Optimal Design for Longitudinal and Multilevel …...alternative hypothesis. In a two-treatment...

146
Optimal Design for Longitudinal and Multilevel Research: Documentation for the “Optimal Design” Software Jessaca Spybrook University of Michigan Stephen. W. Raudenbush University of Chicago Xiao-feng Liu University of South Carolina Richard Congdon Harvard University Andrés Martínez University of Michigan Applies to Optimal Design Version 1.76 Last Revised on March 12, 2008

Transcript of Optimal Design for Longitudinal and Multilevel …...alternative hypothesis. In a two-treatment...

Optimal Design for Longitudinal and Multilevel Research:

Documentation for the “Optimal Design” Software

Jessaca Spybrook

University of Michigan

Stephen. W. Raudenbush

University of Chicago

Xiao-feng Liu

University of South Carolina

Richard Congdon

Harvard University

Andrés Martínez

University of Michigan

Applies to Optimal Design Version 1.76

Last Revised on March 12, 2008

Preface

The Optimal Design software, developed with support from the National Institute

of Mental Health and the William T. Grant Foundation, now contains modules that can

assist researchers in planning single level trials, cluster randomized trials, multi-site

randomized trials, multi-site-cluster randomized trials, cluster randomized trials with

treatment at level three, trials with repeated measures, and cluster randomized trials with

repeated measures.

We regard this version of the software as a “beta version,” meaning that we

distribute it for use under the condition that those who use it are asked to promptly report

difficulties or errors to Andres Martinez ([email protected]), Stephen W. Raudenbush

([email protected]) and/or Jessaca Spybrook ([email protected]). We will

attempt to make needed changes quickly. This documentation will also be revised based

on reviewers’ comments.

2

Table of Contents

1. Cluster Randomized Trials ............................................................................................. 4

2. Including a Cluster Level Covariate in a Cluster randomized trial .............................. 25

3. Using the Optimal Design Software for Cluster Randomized Trials ........................... 32

4. Cluster Randomized Trials with Binary Outcomes ...................................................... 56

5. Using the Optimal Design Software for Cluster Randomized Trials with Binary

Outcomes .......................................................................................................................... 60

6. Multi-site Cluster Randomized Trials ........................................................................... 67

7. Using the Optimal Design Software for Multi-site Cluster Randomized Trials ........... 89

8. Three Level Models with Randomization at Level Three .......................................... 109

9. Using the Optimal Design Software for the Three Level Model with Treatment at

Level Three ..................................................................................................................... 118

10. Repeated Measures in Cluster Randomized Trials ................................................... 127

References ....................................................................................................................... 146

3

1. Cluster Randomized Trials

Cluster randomized trials have become a popular design choice in social science

research. These trials rely on the assignment of clusters to treatments. For example,

assume there are 40 schools in an experiment. In a cluster randomized trial, 20 schools

may be assigned to the experimental treatment, a new math series, and 20 schools may be

assigned to the control, the regular math series. Note that unlike typical designs,

individuals are not randomly assigned to treatment or control, but rather clusters or

groups of individuals are assigned. Readers who are familiar with hierarchical linear

models can think of this as a two level design, students nested within schools

(Raudenbush and Bryk 2002). Here, students are the level-one units and schools are the

level-two units. The treatment contrast is defined at level two.

The first three chapters in this manual provide researchers with a guide to

effectively designing a cluster randomized trial for a continuous outcome. The first

chapter provides an overview of key statistical terms and background information

relating to cluster randomized trials. The second chapter introduces the concept of a

cluster-level covariate to cluster randomized trials. Chapter 3 describes how to use the

Optimal Design software to design a cluster randomized trial with and without a cluster-

level covariate.

1.1 Components of a Cluster randomized trial

In a cluster randomized trial our primary goals are to estimate the difference

between treatments and to determine if their difference is statistically significant. For

example, in the case of the new math series, we might want to determine if there is a

difference in mean math achievement between schools that implement the new series and

schools that use the regular series. Typically, math achievement is measured by a test, so

we might look to see if the students experiencing the new math series scored significantly

higher on average than the students experiencing the regular math series on an

appropriate test given to both groups. To determine if there is a significant difference

4

between the two group means, we must have adequate statistical power. In a completely

balanced cluster randomized trial, the power to detect a difference between the two

groups, or the main effect of treatment, depends on the cluster size (n), the number of

clusters (J), the intra-class correlation ( ρ ), and the effect size (δ ). The remainder of this

chapter examines each of the components of a cluster randomized trial and how they

affect the power of the study.

1.1.1 Statistical Power

Power is the probability of rejecting the null hypothesis when a specific

alternative hypothesis is true. In a study comparing two groups, power is the chance of

rejecting the null hypothesis that the two groups share a common population mean and

therefore claiming that there is a difference between the population means of the two

groups, when in fact there is a difference of a given magnitude. It is thus the chance of

making the correct decision, that the two groups are different from each other. Power is

linked to discussions of hypothesis testing and significance levels so it is important to

have a clear definition of each of these terms before proceeding. Note that in a perfectly

implemented randomized experiment with correctly analyzed data, power is the

probability of discovering a causal effect of treatment when such an effect truly exists.

In hypothesis testing, there are two hypotheses, a null hypothesis and an

alternative hypothesis. In a two-treatment design, the most common null hypothesis states

that there is no significant difference between the population mean for the treatment

group and the control group. The alternative hypothesis states that there is a difference

between groups. The difference may be expressed as a positive treatment effect, a

negative treatment effect, or simply that the treatment mean is not equal to the control

mean. For example, in the case of the new math series, the null hypothesis states that on

average, math achievement will be the same for students using the regular math series

(control group) and students using the new math series (experimental group). However,

the researchers believe that the new math series is better than the regular series. In this

case, the alternative hypothesis states that average math achievement for the experimental

5

group is higher than that of the control group. Thus the alternative hypothesis states that

there is a positive treatment effect. After the hypotheses are clearly stated and the data

has been collected and analyzed, the researcher must decide if there is sufficient evidence

to reject the null hypothesis.

The significance level, often denoted α , is the probability of rejecting the null

hypothesis when it is true. This is known as a Type I error rate. A Type I error occurs

when the researcher finds a significant difference between two groups that do not, in fact,

differ. In the math example, a Type I error would occur if we conclude that students using

the new math series scored higher, on average, than the control group when in fact there

is no difference between the two groups. Typically, alpha is set at 0.05 so that, when the

null hypothesis is true, there is only a 5% chance of making this type of mistake.

Suppose, however, that the null hypothesis is indeed false. A Type II error arises

when we mistakenly retain the null hypothesis. The probability of retaining a false null

hypothesis, often denoted β , is therefore the Type II error rate. In the math example, a

Type II error occurs if a researcher concludes that, on average, math achievement for the

two groups is the same when in fact students using the new math series achieve higher

than students using the regular math series. In this case, the researcher overlooks a

significant difference. The two types of errors are illustrated in Table 1.

Table 1: Possible errors in hypothesis testing.

Do Not Reject the Null

Hypothesis

Reject the Null

Hypothesis

Null Hypothesis is True No Error

(Probability = 1-α )

Type I Error

(Probability = α )

Null Hypothesis is False Type II Error

(Probability = β )

No Error

(Probability = 1- β )

6

If the null hypothesis is true (first row of Table 1), the correct decision is to retain

the null and the probability of this correct decision = Probability (Retain is true)

= 1-

00 | HH

α . With 05.0=α , for example, the probability is 0.95 that we will make the correct

decision of retaining when it is true. The incorrect decision in this case is the Type I

error – rejecting the true . When is true, this error will occur with probability

0H

0H 0H

05= .0α .

On the other hand, if the null hypothesis is false (second row of Table 1) the

correct decision is to reject it. If the probability of making this correct decision is defined

as power = Probability (Reject is false)=00 | HH β−1 . The incorrect decision, known as

the Type II error occurs with probability β , that is Prob(Type II error| false)=0H β .

Looking at the results of a study retrospectively, we know that a researcher who

has retained (column 1 of Table 1) has either made a correct decision or committed a

Type II error. In contrast, a researcher who has rejected (column 2) has either made a

correct decision or committed a Type I error. Note that it is logically impossible for a

researcher who has rejected to have made a Type II error. To criticize such a

researcher for designing a study with low power in this case would be a vacuous

criticism, since a lack of power cannot account for a decision to reject . However, a

researcher who retains the null hypothesis may have committed a Type II error and is

therefore potentially vulnerable to the criticism that the study lacked power. Indeed, low

power studies in which is retained are virtually impossible to interpret. One cannot

claim a new treatment to be ineffective in a study having low power because, by

definition, such a low power study would have little chance of detecting a true difference

between two populations represented in the study.

0H

0H

0H

0H

0H

Although Type I and Type II errors are mutually exclusive, the choice of α can

affect power. Suppose a researcher, worried about committing a Type I error, sets a lower

α , say 001.0=α . If the null hypothesis is true, this researcher will indeed be protected

7

against a Type I error. However, suppose is false. Setting 0H α very low will reduce

power, equivalent to increasing β , the probability of a Type II error. While keeping in

mind that the choice of α affects power, we will for simplicity assume 05.0=α in the

remainder of this discussion in order to focus on sample size as a key determinant of

power.

Of course, neither type of error is desirable and we would prefer to make the

correct decision. As a result, we want the probability of correctly detecting a difference,

that is the power, to be large. Think again about the math example. Assuming the new

math series works better than the control series, we want high power to detect a

difference between the group using the new math series and the group using the regular

math series. In other words, assuming the new curriculum is effective, we seek a high

probability of rejecting the null hypothesis and concluding that, on average, students

using the new math series have higher math achievement. For example, if the power is

0.80, we will correctly identify a difference between the groups with probability 0.80.

Power greater than or equal to 0.80 is often recognized by the research community to be

sufficient, though some researchers seek 0.90 as a minimum.

In a cluster randomized trial, the power of a test is a function of the cluster size, n,

and number of clusters, J, the intra-class correlation, ρ , and the effect size,δ holding

α constant. As we shall see, given ρ and δ , the power in cluster randomized trials is

dominated by the number of clusters, not the number of subjects within a cluster.

Therefore to increase the power, we generally want to increase the number of clusters.

However, increasing the number of clusters may be far more expensive than adding

additional subjects within a cluster, which can be problematic since all studies have a

fixed budget.

Consider the math example. Once a new math program is implemented within a

school, it is relatively inexpensive to test more students and include them in the sample.

Adding more clusters, or schools, is much more expensive. Adding a new school requires

securing an agreement with school leaders to participate, training additional teachers in

8

the new program, buying the necessary supplies for a school, and paying for data

collectors to travel to the school. This can be very costly, and it may not be feasible to

include a large number of schools.

In addition to sample size, the desired effect size and intra-class correlation

coefficient also contribute to the power of the test. Larger effect sizes produce higher

power. Smaller values of the intra-class correlation coefficient, which measures the

fraction of variation lying between schools, also increase power. However, the researcher

does not have as much control over these quantities as they are strongly determined by

the phenomenon under investigation. Let’s take a closer look at the model to see how n,

J, ρ , and δ influence power.

1.1.2 The Model

We can write the model for a cluster randomized trial in hierarchical form, with

individuals nested within clusters. The level-1, or person-level model is:

ijjij eY += 0β , (1) ),0(~ 2σNeij

for persons per cluster and },...,2,1{ ni ∈ },...,2,1{ Jj ∈ clusters,

where is the outcome for person i in cluster j; ijY

j0β is the mean for cluster j;

ije is the error associated with each person; and

is the within-cluster variance. 2σ

The level-2 model, or cluster-level model is:

jjj uW 001000 ++= γγβ ),0(~0 τNu j (2)

where 00γ is the grand mean;

01γ is the mean difference between the treatment and control group or the main

effect of treatment;

jW is the treatment contrast indicator, ½ for treatment and -½ for control;

9

ju0 is the random effect associated with each cluster; and

τ is the variance between clusters.

Replacing (2) in (1) yields the mixed model:

ijjjij euWY +++= 00100 γγ , and . (3) ),0(~0 τNu j ),0(~ 2σNeij

We are interested in the main effect of treatment, 01γ , estimated by:

CE YY__

01ˆ −=γ , (4)

where is the mean for the experimental group and is the mean for the control

group. When each treatment has an equal number, J/2, of clusters, the variance of the

main effect of treatment is:

EY_

CY_

JnVar )/(4)ˆ(

2

01στγ +

= (5)

where n is the total number of participants per cluster and J is the total number of

clusters.

1.1.3 Testing the Main Effect of Treatment

We can use hypothesis testing to determine if the main effect of treatment is

“statistically significant,” that is, not readily attributable to chance. Recall that a two-

tailed null hypothesis states there is no difference whereas the alternative hypothesis

states there is a difference. In symbols: 0: 010 =γH

0: 011 ≠γH

If the data are balanced, that is, there is an equal number of participants in each cluster,

we can use the results of a two factor nested ANOVA to test the main effect of

treatment.1 The test statistic is an F statistic, which compares treatment variance to

cluster variance. The F statistic is defined as:

1 This is the same result we would obtain using a two-level hierarchical linear model (Equations 1and 2) estimated by means of restricted maximum likelihood.

10

)()(

cluster

treatment

MSMS

Fstatistic = (6)

Note the F statistic converges to the ratio of expected mean squares, which is defined as:

2

201

2

201

2 4/1

4/)()(

στ

γ

στ

γστ

++=

+

++=

nnJ

n

nJnMSE

MSE

cluster

treatment (7)

and can be rewritten as:

λ+= 1)()(

cluster

treatment

MSEMSE

where 2

201 4/

στ

γλ

+=

nnJ . (8)

If the null hypothesis is true, the F statistic follows a central F distribution with 1 degree

of freedom for the numerator and J-2 degrees of freedom for the denominator. Under the

central F distribution, we would expect the F statistic to be approximately 1. In other

words, there is no variation between treatments so 001 ≈γ and the term in the

numerator of the expected mean square ratio goes towards 0. We see that if

4/201γnJ

0=λ the

ratio of expected mean squares thus reduces to .1=12

2+=

+

+

σσ

))(

= λττ

nn

EE

(MSMS

cluster

treatment

If the null hypothesis is false so that there is a treatment difference, that is 001 ≠γ ,

the F statistic follows a non-central F distribution with 1 degree of freedom for the

numerator and J-2 degrees of freedom for the denominator. Then the ratio of expected

mean squares becomes the non-central F distribution, characterized by a non-centrality

parameter, λ (See Equation 8). λ can be rewritten as:

Jn /)/(4 2

201

στγ

λ+

= (9)

Note that λ , known as the non-centrality parameter, is the ratio of the squared main

effect to the variance of the estimate of the treatment effect. Equation 9 clearly shows that

the non-centrality parameter, λ , is a function of 01γ , n, J, τ , and . 2σ

The non-centrality parameter is strongly related to the power of the test. As

λ increases, the power increases. Let’s see what makes λ increase. Increasing the

treatment effect increases λ . Thus, if we are trying to detect a larger difference in means,

λ increases and so the power also increases. Note that the denominator is identical to the

11

variance of the treatment effect (Equation 5). So to increase λ we could decrease the

variance of the main effect of treatment. Because the standard error of the treatment

effect is more commonly discussed, instead of referring to the variance of the main effect

of treatment, we will refer to the standard error of the main effect of treatment, which is

simply:

J

nSE )/(4)ˆ(2

01στγ +

= (10)

Notice that increasing n and J will decrease the standard error thus increasing the power.

Also, decreasingτ and will decrease the standard error and increase the power. The

remainder of this chapter explores how n, J,

τ and affect the power of the test. 2σ

1.1.4 Cluster Size, n

The cluster size, n, refers to the number of participants in each cluster. In the

school example, n is the number of students in the new math series group (experimental

group) or the number of students in the regular math series group (control group). In

general, increasing n decreases the standard error of the treatment effect (equation 10)

thus increasing the power. However, at some point, increasing n without increasing the

number of clusters, J, provides no further benefit. Thus as ∞→n , we can see that for

Equation 10, JSE /2)ˆ( 01 τγ = , which will be zero unless .0 =τ

1.1.5 Number of Clusters, J

As the total number of clusters, J, increases, the power to detect significant

differences also increases. As mentioned earlier, the number of clusters has a stronger

influence on power than the cluster size. As J increases towards infinity, the power

approaches 1 regardless of n. This is because as J increases towards infinity, the standard

error (10) gets infinitely small. This causes the non-centrality parameter to increase

towards infinity, which results in the power approaching 1. Intuitively this makes us think

that we should just continue to increase J until the desired power is achieved. However,

12

increasing J or adding additional clusters may not be feasible due to budgetary

constraints. Choosing the optimal sample size with a fixed budget is discussed more

thoroughly in section 1.2.

1.1.6 Intra-Class Correlation, ρ

In addition to the number of clusters, the variability between clusters also affects

power. The variability is defined in terms of the intra-class correlation coefficient, ρ . The

intra-class correlation, ρ , is a ratio of the variability between clusters to the total

variability:

2σττρ

+= (11)

where τ is the variation between clusters;

is the variation within clusters; and 2σ

is the total variation. 2στ +

For US data sets on school achievement, ρ typically ranges between 0.05 and 0.15. In

neighborhood research on mental health, ρ will generally be smaller . Because is

the total variation, we can constrain it to be 1. Algebraic manipulation of the formula then

reveals and . As

2στ +

τρ = 21 σρ =− ρ increases we know more of the variation is due to

between-cluster variability. Replacing τ and with 2σ ρ and 1- ρ in the standard error

formula (equation 10), the standard error of the main effect of treatment can be rewritten

as:

J

nSE )/)1((4)ˆ( 01ρργ −+

= (12)

From equation 12, we can see that increased values of ρ increase the standard error thus

decreasing the power. Also, as ρ increases, the effect of n decreases. Therefore, if there is

a lot of variability between clusters, we gain more power by increasing the number of

clusters sampled. The key idea for ρ is that power increases as ρ decreases for a fixed n

and J.

13

1.1.7 Standardized Effect Size,δ

The treatment effect is the difference between the mean of the two groups.

However, because data for different experiments is collected in different scales,

standardizing the data is important so the results are meaningful to any researcher, not

just someone who is familiar with a particular data set. A standardized effect size,δ , is

the population means difference of the two groups divided by the standard error of the

outcome:

2

01

στ

γδ

+= (13)

where CE μμγ −=01 ;

Eμ is the population mean for the experimental group; and

Cμ is the population mean for the control group;

Given and 2σ τ , the standardized effect size, δ , is estimated by:

τσδ

+

−=

2

__

CE yy (14)

and

JnSE )/)1((4)ˆ( ρρδ −+

= (15)

Standardized effect sizes between 0.50 and 0.80 are considered large, and effect sizes as

small as 0.20 to 0.30 are often considered worth detecting. Note that larger effect sizes

are easier to detect. The interpretation of a given δ as “large” or “small,” is, however,

sensitive to the research setting and the capacity of researchers to implement powerful

treatment and to measure outcomes with high validity.

Prior to calculating the actual effect size from the sample, the researcher must

specify a desired minimum effect size to calculate the power of the test. Recall that the

power of the test is driven by the non-centrality parameter, λ (equation 9). We can

redefine λ in terms of the standardized effect size as shown below:

14

)/)1((4)1(4/ 22

nJ

nnJ

ρρδ

ρρδλ

−+=

−+= (16)

Now we can calculate the power of the test knowing only n, J, δ , and .ρ The Optimal

Design Software uses the standardized model notation.

1.1.8 Example

Let’s take a look at an example to see how the various components work together

to affect the power of a test. Suppose a new literacy program has been developed. The

founders of the new program propose that students who participate in the program will

have increased reading achievement. They decide to conduct a cluster randomized trial.

Based on past studies, they estimate ρ as 0.05, meaning that 5% of the total variation in

the outcome lies between clusters. The researchers want to be able to detect a minimum

effect size of 0.20, or 20% of a standard deviation. Note this is a small effect size.

Assume they have 50 students in each cluster. How many clusters are necessary to

achieve power = 0.80?

Let’s take a look at Figure 1, produced with the OD software. The graph shows

the power on the y-axis varying as a function of the number of clusters on the x-axis,

while holding constant ρ =0.05, δ =0.20, and n=50.

15

Figure 1: CRT - Power vs. Number of Clusters

Number of clusters

Power

23 42 61 80 99

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 50

δ= 0.20,ρ= 0.05

We can see that as J increases, the power increases towards 1.0. Clicking on the graph at

where power = 0.80 reveals J = 56. This means a total of 56 clusters, 28 per treatment,

are necessary to achieve power = 0.80 when ρ =0.05, δ =0.20, and n=50.

Let’s see how the graph would change if the expected effect size is increased to

0.40 while holding ρ =0.05 and n=50. Since this is a larger effect size, we would expect

to be able to achieve power = 0.80 without needing as many clusters. Figure 2 displays

both graphs.

16

Figure 2: CRT - Power vs. Number of Clusters

Power

23 42 61 80 99

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 50

δ= 0.20,ρ= 0.05δ= 0.40,ρ= 0.05

Number of clusters

Looking at the two graphs we can see that if the effect size is 0.5, fewer clusters are

needed to achieve power of 0.80. Clicking on the trajectory reveals that 16 clusters, 8 per

treatment, are necessary to achieve power = 0.80. Recall 56 clusters were necessary for

an effect size of 0.20 so this is a big reduction.

Let’s see how the graph would change for different values of ρ . Assume that two

values of ρ based on past studies are 0.05 and 0.10. We expect the power to decrease as

ρ increases for a fixed sample size. Let’s take a look at Figure 3.

17

Figure 3: CRT - Power vs. Number of Clusters

Power

23 42 61 80 99

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 50

δ= 0.20,ρ= 0.05δ= 0.20,ρ= 0.10δ= 0.40,ρ= 0.05δ= 0.40,ρ= 0.10

Number of clusters

For both effect sizes, the larger value of ρ increases the number of clusters necessary to

achieve power = 0.80 to increase. Though the increase in ρ may seem small, to achieve

power = 0.80 with δ = 0.20 and ρ =0.10, the number of clusters necessary jumps from

56 to 96.

Let’s see how things change if we allow the cluster size to vary and fix the

number of clusters. The graph in Figure 4 allows n to vary along the x-axis and shows the

corresponding power along the y-axis for a fixed ,,δρ and J.

18

Figure 4: CRT - Power vs. Number of Subjects per Cluster

Power

13 24 35 46 57

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 J = 20

δ= 0.20,ρ= 0.05δ= 0.20,ρ= 0.10δ= 0.40,ρ= 0.05δ= 0.40,ρ= 0.10

Number of subjects per cluster

As we can see in the graphs, increasing n does not make the power increase

towards 1. When the number of clusters was allowed to increase, the power increased

much more rapidly towards 1, which shows that the number of clusters is more influential

than the cluster size in increasing power. Thinking back to our standard error formula,

this is exactly as we would expect. Recall the standard error formula below:

SE(Main Effect of Treatment)J

n)/)1((4 ρρ −+=

As J increases toward infinity, the standard error get infinitely small thus the power will

increase towards 1. However, as n gets larger, the standard error does not get infinitely

small. Instead the standard error of the main effect of treatment reduces to:

SE(Main Effect of Treatment) = Jρ4 as approaches n ∞

Thus it is clear that increasing J has a greater affect on the power than increasing n under

.0=ρ

19

This example makes it clear that there are many things to consider when planning

a cluster randomized trial. The OD software is a tool for helping the researcher design a

study with the appropriate number of subjects, clusters, and adequate power. Details

regarding how to produce the figures in the example are discussed in Chapter 3.

1.2 Optimal Sample Allocation

As described in 1.1, power depends on within cluster sample size, n, the number

of clusters, J, the intra-class correlation, ρ , and the desired effect size,δ . ρ and δ are

typically estimated by the researcher based on prior knowledge and similar studies. This

leaves the sample size components, n and J for the researcher to specify.

It is a common belief that increasing n will increase the power. However, as we

saw earlier increasing n only increases power to a certain point. Power is more strongly

affected by increasing J rather than increasing n in cluster randomized designs. This may

suggest that the best thing to do is just to make J very large. However, as previously

discussed, adding more clusters is often expensive, and usually costs more than adding

people within a cluster. Because many studies are on a limited budget, it is important to

find the optimal allocation of n and J to for a fixed budget.

The total variable cost of data collection can often be reasonably approximated by

the formula below:

)( 21 CnCJT += (17)

where J = number of clusters;

n = number of participants within a cluster;

C1 = cost per participant;

C2 = cost per cluster; and

T = total cost.

20

To calculate the optimal sample size, first find the optimal n and then find the

optimal J. The optimal n in this case is that which minimizes the variance of the

treatment effect. Recall the variance of the main effect of treatment is defined in equation

5 is

JnVar )/(4)ˆ(

2

01στγ +

= .

Substituting 21 CnC

TJ+

= (a simple rearrangement of the cost equation) and minimizing

the equation with respect to n, we obtain the formula for optimal n:

1

2*CC

noptτσ

= (18)

where σ is the within cluster standard deviation;

τ is the between cluster standard deviation;

C1 is the cost per person; and

C2 is the cost per cluster

From the formula, we can see as the within-cluster variance increases relative to

the between-cluster variance, optimal n increases. Intuitively this makes sense. If there is

large variation within clusters, we would want to sample more people in each cluster to

represent that variation. However, if the within cluster variation is very small, optimal n

decreases. In this case, we want fewer people in each cluster because most of the

variation is between clusters so adding more people will not be very helpful. In terms of

the cost ratio, if the cost per cluster becomes increasingly larger than cost per person we

are penalized for adding clusters and the optimal n increases. After the optimal n is

found, the number of clusters can be calculated by plugging back n into the formula for J:

21 CnCTJ+

= (19)

The cost per cluster and cost per person may be the same in the control and experimental

groups or it may differ. The remainder of this chapter looks at optimal sample allocation

when costs of sampling the two groups are equal and when they are not equal.

21

1.2.1 Equal Costs

The simplest case is when the sampling costs are the same for the treatment and

control groups. The following example illustrates how to calculate the optimal n and the

resulting J to minimize the variance for a fixed budget.

A researcher wants to determine the effect of a new drug prevention program in

schools. The total budget for sampling costs is $10000. The cost per cluster (C2) is $400

and cost per person (C1) is $20. The estimated intra-class correlation coefficient is 0.05.

What is the optimal n? How many clusters will be in the study? Using formulas 16 and

17 described above, the optimal n and J can be computed by hand as shown below.

Step 1: Set =1, so and . For this example, 2στ + ρτ = ρσ −= 12 τ =.05 and =.95 2σ

Step 2: Calculate 2236.=τ and 9747.2 =σ

Step 3: Find the cost ratio 1

2

CC = 400/20 = 20

Step 4: Set up the equation 2020

400*2236.9747.

≈=optn

Plugging 20 into 21 CnC

TJ+

= yields J = 12.5 which is rounded down to 12 in order to

stay within budget. The value of the variance of the treatment effect can also be

calculated by plugging in n and J to the variance equation.

The Optimal Design software can be used to do these calculations. The software

produces a plot as shown below:

22

Figure 5: CRT - Optimal n vs. rho

Intra-class correlation

n

0.06 0.11 0.15 0.20 0.25

4.4

8.9

13.3

17.8

22.2

26.7

31.1

35.6

40.0

44.5α = 0.050

C2/C1=20.000

The plot allows the researcher to see how the optimal n changes with respect to

the intra-class correlation coefficient. Notice that as ρ increases, optimal n decreases. In

other words, if there is large between-cluster variance then it is not very helpful to

increase the number of people per cluster and more money should be spent trying to

increase the number of clusters.

Notice that in the previous example there were no power calculations or set effect

sizes. If the desired effect size is specified, then the Optimal Design software can be used

to calculate the optimal n and J that maximizes power. For example, recall in the example

above that: T=$10,000, C2=$400, C1=$20, and 05.0=ρ . Imagine that the desired effect

size is 0.40. Plugging these values into the OD software which solves for n and J to

maximize the power reveals an optimal n = 18, J =13, and power = 0.53. Knowing that

the power is only 0.53 and acceptable power levels are typically 0.80 or higher, the

researcher may need to try to increase the budget in order to achieve higher power.

23

1.2.2 Unequal Costs

If the cost of sampling persons or clusters varies across the treatment groups, the

optimal design will not be balanced, even assuming variances to be the same in each

treatment group). The optimal cluster size and/or the optimal number of clusters will be

different as a function of these cost differences. However, the current version of the

Optimal Design software does not provide optimal allocation formulas in this setting.

24

2. Including a Cluster Level Covariate in a Cluster randomized trial

A common problem facing researchers designing cluster randomized trials is that

the cost of a study frequently limits the number of clusters, resulting in a lack of

statistical power. One method to combat this problem is to include a covariate in the

design and analysis of a cluster randomized trial. Including a covariate may reduce the

number of clusters necessary to achieve a specified level of power. This chapter provides

a brief conceptual background for including a cluster level covariate in a cluster

randomized trial.

2.1 Why include a cluster level covariate?

To illustrate, let’s consider another new math program. Suppose the goal of the

study is to determine if a new 2nd grade math series is superior to the standard math

series. Let’s assume that 10.0=ρ and the researcher seeks to discover a minimum effect

size of 0.20. Assume there are 50 2nd grade students from each school and the researcher

has secured 50 schools, 25 in the treatment group and 25 in the control group. Due to

budgetary constraints, 50 schools and 50 2nd grade subjects within each school is the

maximum number available to the researcher. Entering δρ, , n and J into the cluster

randomized trial option in the Optimal Design software reveals that the researcher only

has power = 0.52. The low power makes it difficult for the researcher to detect the

expected effect. Thus, an important effect may well go undetected. Including a covariate

in the design and analysis may greatly increase the power.

In this chapter, we focus specifically on including a covariate at the cluster level.

This may be an aggregated covariate, such as pre-test scores aggregated across schools or

school SES. Recall that power in a cluster randomized trial is a function of the minimum

detectable effect size, ,δ the intra-class correlation, ,ρ the number of clusters, J, and the

cluster size, n, while holding α constant. When we include a covariate in the design, there

is an additional component that influences the power of the test: the strength of the

correlation between the covariate and the true cluster mean outcome. The strength of the

25

correlation between the covariate and the true cluster mean is denoted 0βρ x . We adopt

this notation because ojβ is the true mean outcome for cluster j, and is the covariate.

The residual level-2 variance, or unexplained variance after accounting for the covariate,

is denoted . As we will see later, the stronger the correlation,

jX

x|τ0βρ x , the smaller the

conditional level 2 variance, , compared to the unconditional level 2 variance, , and

the greater the benefit of the covariate in increasing precision and power. Let’s take a

closer look at the model with a cluster-level covariate.

x|τ τ

2.2 The Model

In hierarchical form, the level-1 model for a cluster randomized trial with a

cluster-level covariate is the same as the level-1 model in Chapter 1

ijj eY +0β (1) ),0(~ 2σNeijij =

for persons per cluster and },1{ ni ∈ },...,2,1{ Jj ∈ clusters, ,...,2

where j0β is the mean for cluster j;

ije is the error associated with person i in cluster j; and

is the within-cluster variance. 2

j

σ

The level-2 model, or cluster-level model differs from a simple cluster randomized trial

because it includes a term for the cluster-level covariate. The model is:

jW X ojj u (~0 j Nu ),0 |x (2) +++00=0 0201 γγγβ , τ

where 00γ is the grand mean;

01γ is the mean difference between the treatment and control group or the main

effect of treatment;

02γ is the regression coefficient for the cluster-level covariate;

jW is the treatment contrast indicator, ½ for treatment and -½ for control;

ijX is the cluster-level covariate, centered around its group mean;

ju0 is the random effect associated with each cluster; and

26

x|τ is the residual variance between clusters.

Note that the between-cluster variance, x|τ , is now the residual variance

conditional on the cluster level covariate X. For the purposes of this paper, we assume

there is no interaction between the cluster level covariate, X, and the treatment group, W.

This is an assumption that can be relaxed and in general should be checked given that a

researcher is interested in how the treatment effect may vary at different levels of the

covariate.

Similar to the cluster randomized trial without a covariate, we are interested in the

main effect of treatment, or the difference between the treatment average and control

average adjusting for the covariate. However, now it is estimated by:

(3) )(__

02

^__

01

^CECE XXYY −−−= γγ

where is the mean for the experimental group; EY_

CY_

is the mean for the control group;

is the covariate mean for the experimental group; and EX_

is the covariate mean for the control group CX_

Note that the estimated main effect of treatment looks like the estimated effect

without the covariate except that here we are adjusting for treatment group differences in

the covariate. The variance of the main effect of treatment is estimated by (Raudenbush

1997):

⎥⎦⎤

⎢⎣⎡

−+

+=

411

)/(4)ˆ(

2|

01 JJn

Var x στγ (4)

where n is the total number of subjects;

J is the total number of clusters; and

x|τ is the conditional level 2 variance, . xx |2 )1(

0τρ β−

27

2.3 Testing the Main Effect of Treatment

Similar to the case without a covariate, we can use hypothesis testing to determine

if the main effect of treatment is “statistically significant,” this is, not readily attributable

to chance. If the data are balanced, we can use the results of a nested analysis of

covariance with random effects for clusters and fixed effects for the treatment and

covariate. The test statistic is an F statistic, which compares adjusted treatment variance

to the adjusted cluster variance. The F statistic is defined as:

clusters

treatment

MSMS

Fstatistic = , (5)

where and are now adjusted for the covariate. treatmentMS clusterMS

Note that the F statistic converges to the ratio of expected mean squares, defined as:

xclusters

treatment

MSEMSE

λ+= 1)()(

(6)

The F test follows a non-central F distribution, F(1,J-3, )xλ in the case of a cluster-level

covariate where the non-centrality parameter, xλ , is:

)/(4 2

|

201

nJ

xx

στ

γλ

+= (7)

and

. (8) τρτ β )1( 2| oxx −=

From equations 7 and 8, we can see that the stronger the correlation, 0βρ x , the smaller

x|τ , and the greater the increase in the power of the test.

Note that the non-centrality parameter with and without the covariate are closely

related. If the correlation between the covariate and the cluster level mean is 0, x|τ reduces

to and the non-centrality parameter reduces toτ ,λ the non-centrality parameter in the

case of no covariate. Although we are reducing the between cluster variance, one

consequence of including a covariate is that we lose one degree of freedom. In the case of

no covariate, the F test follows a non-central F distribution, F(1, J-2, )λ whereas in the

28

covariate case we have );3,1( xJF λ− . This may be a potential problem in a study with a

small number of clusters.

The non-centrality parameter can be defined in standardized notation. Recall that

in equation 7 we define the non-centrality parameter as )/(4 2

|

201

nJ

xx στ

γλ

+= . Replacing

, constraining , and defining τρτ β )1( 2| oxx −= 1=2+στ

2

10

στ

γδ

+= we can rewrite xλ

as a function of δ , ρ and , as shown below: 0βρ x

]/)1()

2

nJ

ρρδ

−+ (9)

1[(4 20x

x ρλ

β−=

Note that the only difference in the non-centrality parameter in the case of the cluster

level covariate is the correction factor, . The correction factor only affects )1( 20βρ x− ρ ,

the between-cluster variation since the covariate is a cluster-level covariate. As the

correlation between the covariate and the cluster level means increases, the unconditional

intra-class correlation decreases. This results in an increase in the value of the non-

centrality parameter and therefore an increase in the power of the test.

2.4 Example

Recall that in the example in section 2.1, the researchers wanted to test the

effectiveness of a new 2nd grade math series. They estimated ,10.0=ρ desired a

minimum effect size of 0.20, had 50 clusters and 50 subjects per cluster. The power to

detect an effect was only 0.52. Suppose the researchers plan to give the students a pre-test

prior to implementing the new program. Pre-test scores will be aggregated to the school

level. Based on past research, they estimate that the pre-test has a correlation of 0.75 with

the true cluster mean post-test. In other words, the cluster-level pre-test scores explain

percent of the variation in true cluster-level post-test scores. What is the

new power that the researchers achieve when they include the covariate in the design?

5625.075.0 2 =

29

To determine the power, we can use the cluster randomized trial option from the

OD software. Figure 1 below shows the trajectory with and without the covariate.

Figure 1. Power vs. number of clusters.

Number of clusters

Power

0.1

23 42 61 80 99

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 50

δ= 0.20,ρ= 0.10δ= 0.20,ρ= 0.10,R2

L2= 0.56

The trajectory indicated by the dotted line uses the information contained in the covariate.

Clicking along the trajectory, we find that with 50 clusters the researchers can achieve

power = 0.80. In other words, the power to detect an effect increased from 0.52 to 0.80 by

including the covariate.

Although the use of a covariate can substantially increase power in a cluster

randomized trial, there are caveats. First, choice of the covariate must be specified during

the design phase, prior to any data analysis. A procedure based on checking model

estimates based on a list of possible covariates using the study data will produce biased

tests of the significance of the treatment effect. If point estimates of the treatment effect

look very different with and without adjustment of the covariate, a skeptic may suspect

30

opportunistic choice of the covariate. Results that are highly sensitive to choice of the

covariate inevitable arouse uncertainty. Finally, as mentioned, the researcher should

check the assumption that the covariate association with the outcome is the same across

treatment groups.

31

3. Using the Optimal Design Software for Cluster Randomized Trials

This chapter focuses on how to use the Optimal Design (OD) software to design a

cluster randomized trial with a continuous outcome. Section 3.1 gives some general

information regarding the software. Section 3.2 presents an example that is used to

illustrate the OD software and is explored in detail in subsequent sections. Section 3.3

explains how to use the software to design a study with budgetary constraints.

3.1 General Information

The screen shown in Figure 1 is the OD screen that appears when the software is

opened.

Figure 1. Main menu OD screen.

In this chapter, we focus on the Cluster Randomized Trial (continuous outcomes)

and Optimal Sample Allocation option. Chapter 5 discusses how to use the Optimal

Design program for the case of a Cluster randomized trial with a binary outcome.

Clicking on the Cluster Randomized Trials heading displays the options listed below:

32

Power for main effect of treatment (continuous outcome)

Power vs. cluster size (n)

Power vs. number of clusters (J)

Power vs. intra-class cluster correlation (rho)

Power vs. effect size (delta)

Power vs. proportion of explained variation by level 2 covariate (R2)2

Power for main effect of treatment (binary outcome)

Power vs. cluster size (n)

Power vs. number of clusters (J)

Power vs. probability of success in treatment group (phi(E))

Optimal sample allocation under budgetary constraints

All of the subheadings under power for the main effect of treatment (continuous

outcome) offer the researcher an opportunity to explore one specific design element in

relation to the power of the study. Each option produces a graph with power on the y-axis

and the specified design element on the x-axis. For example, power vs. cluster size allows

the researcher to see how the power of the test changes (y-axis on the graph) as the

cluster size increases (x-axis on the graph) for fixed J, ρ , and δ . The five subheadings

under power for the main effect of treatment function similarly, so once you are familiar

with one option, the others follow easily. The option for Optimal sample allocation under

budgetary constraints allows a researcher to design a study with a fixed budget. This

option is slightly different from the previous options and is discussed in section 3.3.

Below are a few general things to keep in mind when using the OD software:

1. After clicking on any of the four power options, a new screen with a toolbar

will appear similar to the one in Figure 2:

2 Note that this differs from Version 1.0 of the program. In Version 1.0, the program asked for a covariate correlation. In Version 2.0, the program asks for the proportion of explained variation by the level 2 covariate, R2.

33

Figure 2. CRT - Power vs. cluster size (n) screen.

However, a graph will not appear until you click on one of the buttons on the toolbar and

click ok.

2. Once you click one option, for example, power vs. cluster size (n), you cannot

click on another option until you click on the X to close the graph.

3.2 Example

Suppose a team of researchers develop a new literacy program for 1st graders. The

founders of the new program propose that students who participate in the program will

have increased reading achievement. They plan to test students who participate in the

new program (experimental group) and students who participate in the regular program

(control group) using a reading test to determine if students using the new program score

higher. The researchers have access to last years 1st grade average reading test scores for

each school. Past data reveals that last years scores explain 49% of the variation in test

scores. The researchers want to design a cluster randomized trial with students nested

within classrooms but are not sure how to proceed. Five scenarios the researchers might

encounter are presented below. Assume 05.0=α for each case.

34

3.2.1 Scenario 1 – Unknown Cluster Size (n)

Based on past studies, the researchers estimate ρ = 0.05 and want to be able to

detect a minimum effect size of 0.25. Assuming 40 classrooms (clusters) are willing to

participate in the study, how many students per classroom are necessary to achieve

power = 0.80? Find the power of the study with and without the covariate.

In Scenario 1, the cluster size, or number of students per school is unknown. As a

result, we want to select the power vs. cluster size (n) option. This allows the cluster size

to vary along the x-axis. To explore the power vs. cluster size (n) option, click on it.

Figure 3 displays the screen that appears.

Figure 3. CRT - Power vs. cluster size (n) screen.

The toolbar runs across the top of the window. Let’s take a closer look at the function of

each of the buttons on the tool bar.

α - specifies the significance level, or chance of a type I error. By default α is set at 0

which is a common level for most designs.

.05,

J – specifies the number of clusters. By default, J is set at 20, but it can be changed based

on the researcher’s needs.

35

δ - specifies the minimum effect size of interest. By default, the minimum effect size is

set at 0.20 and 0.40. Trajectories for both effect sizes are plotted so they can be

compared. The researcher is allowed to enter up to three different effect sizes.

−ρ specifies the intra-class correlation. Be default, ρ is set at 0.05 and 0.10. Again two

values are specified to allow for comparisons. The researcher is allowed to enter up to

three different values of ρ . 2

2LR - specifies the proportion of the variation in the level 2 outcome that is explained by

level 2 covariate.

<x< – sets the range of the x-axis. The x-axis displays the range of the cluster size n. By

default it is set to 2 to 60, but the researcher can change the range.

<y< - sets the range of the y-axis. The y-axis displays the range of the power. Power

ranges from 0 to 1.

Plot graph – plots the graph with all the default settings.

IEG – sets the graph legends. This allows the researcher to give specific labels and titles

to a graph.

Save – saves the graph (See Appendix A for details)

Print – prints the graph

Defs – sets the parameters to default setting.

? – is a help option.

X – closes the window.

Note: Clicking ok after clicking on any of the buttons along the toolbar automatically

displays the graph with the default settings. Once the graph is on the screen, clicking on a

specific parameter allows you to change or add values for that parameter.

Follow the steps below to answer the question.

Step 1: Click on power vs. cluster size (n).

Step 2: Click on J on the toolbar and change J(1) to 40, the total number of clusters in

this study. Clicking ok makes the graphs appear in the window. Below is the new screen:

36

Figure 4. CRT - Power vs. cluster size with J=40, δ =0.20 and 0.40, ρ =0.05 and 0.10

Note there is a legend that appears in the upper right corner. This defines each of the

trajectories on the screen. This is a quick way to check if δ and ρ are defined correctly. In

our case, since we want δ =0.25, we need to change the settings.

Step 3: Click on δ on the toolbar. Notice delta(1) is set to 0.20 and delta (2) is set to 0.40,

which are the default settings. Change delta(1) to 0.25 and delete delta(2). This allows us

to compare the number of subjects necessary per cluster if we desired a minimum effect

size of 0.25. An additional value of delta can also be added if desired. Click ok. The new

screen is in Figure 5.

37

Figure 5. CRT - Power vs. cluster size with J=40, δ =0.25, ρ =0.05 and 0.10

Step 4: Looking at the legend, we know the correct ρ for this example is specified.

However, click on ρ on the toolbar to see the options. Notice rho(1) is set to 0.05 and

rho(2) is set to 0.10, the default settings. By leaving rho(2) at 0.10, we are able to see

how changing the value of rho affects the necessary cluster size for a specified power. An

additional value of rho can also be added if desired. Click ok. Since we did not make any

changes, the screen stays the same.

Step 5: Looking at the legend, we know the correct α is specified. Clicking on α on the

toolbar we see α is set to 0.05.

Step 6: Recall that in Scenario 1 we are trying to determine the number of people we

need in each cluster to achieve power of 0.80 with J=40, δ =0.25, and ρ =0.05. Using the

legend in Figure 5 to find the trajectory that matches our specifications, we can click

along the correct trajectory until the power = 0.80 to determine the appropriate n. In this

case, n = 37, so 37 people are required in each cluster. However, 37 people in one

classroom might be unreasonable. Let’s see what happens when we add a covariate.

38

Step 7: To add the information from the covariate, click on . Now you may enter up

to three values for . Leave equal to 0 but enter equal to 0.49. Click

ok. The new screen is in Figure 6.

22LR2LevelR2

2LR )1(22LevelR )2(2

Figure 6. CRT - Power vs. cluster size with J=40, δ =0.25, ρ =0.05 and 0.10,

and 0.49. 022 =LR

By including the covariate, we can achieve power = 0.80 for δ =0.25 and ρ =0.05 with a

classroom (cluster) size of only 19, which is more realistic.

Step 8: Add another value for δ by clicking on δ on the toolbar and specify delta(3) =

0.50. Click ok. The new screen is in Figure 7.

Figure 7. CRT - Power vs. cluster size with J=40, δ =0.25, 0.50, ρ =0.05, 0.10, and

, 0.49. 022 =LR

39

As you can see, 8 trajectories appear on the screen, one for each combination of δ , ρ ,

and . The key in the upper right corner defines the various trajectories. Notice the

larger desired effect sizes achieve higher power with fewer people per cluster than do the

smaller effect sizes. Intuitively, this makes sense because it is easier to detect a larger

effect size than a smaller effect size. Also, larger values of

22LR

ρ decrease the power for a

specified effect size. Note that the power does not approach 1 in every case because

increasing n only increases power to a certain point.

Click X on the toolbar to close the screen and select a new option.

3.2.2 Scenario 2 – Unknown Number of Clusters (J)

Based on past studies, the researchers estimate ρ = 0.05 and want to be able to

detect a minimum effect size of 0.25. Assuming that 20 students are willing to participate

in the study from each classroom, how many classrooms (clusters) are necessary to

achieve power = 0.80? Find the power of the study with and without the covariate.

40

In Scenario 2, the number of clusters, J is unknown. As a result, we want to select

the power vs. number of clusters (J) option. This allows the number of clusters to vary

along the x-axis. Clicking on power vs. number of clusters (J) reveals a blank screen and

a toolbar that looks very similar to the toolbar for power vs. cluster size (n) in Figure 3.

The only difference is now there is an n on the toolbar instead of a J. This is because now

n is set while J is allowed to vary.

Follow the steps below to answer the questions:

Step 1: Click on power vs. number of clusters (J).

Step 2: Click on n on the toolbar. Click ok because the default is 20, which is the number

of students per cluster in this study. Figure 8 displays the screen.

Figure 8. CRT - Power vs. number of clusters with n=20, δ =0.20,0.40, and ρ =0.05,

0.10.

41

Looking at the key, we know we need to changeδ since we are looking for a minimum

effect size of 0.25, which is not the default setting. We can also see that we need to

delete ρ =0.10 since we are interested in ρ =0.05.

Step 3: Click on δ and change delta (1) = 0.25 and delete delta (2).

Step 4: Click on ρ and delete rho (2). Figure 9 displays the new screen.

Figure 9. CRT - Power vs. number of clusters with n=20, δ =0.25, and ρ =0.05.

We can click along the correct trajectory until the power = 0.80 to determine the

necessary J when there is no covariate. Clicking along the trajectory reveals that

approximately 50 clusters are necessary to obtain power =0.80. Notice that as the number

of clusters increases, the power approaches 1 for each trajectory.

Step 5: Click on in order to include the covariate in the power analysis. In order to

compare the designs with and without a covariate, leave (1) equal to 0. Recall that

the covariate explained 49% of the variation in the cluster-level outcome so enter 0.49

for (2). Figure 10 displays the new screen.

22LR

22LevelR

22LevelR

42

Figure 10. CRT - Power vs. number of clusters with n=20, δ =0.25, ρ =0.05, and =0

and 0.49.

22LR

Clicking along the trajectory that includes the covariate, we can see that 40 clusters are

necessary to achieve power = 0.80. Including the covariate reduced the total number of

clusters by 10 which will help reduce the costs of the experiment.

3.2.3 Scenario 3 – Unknown intra-class correlation (rho)

The researchers have 40 classrooms in the study and 50 students per classroom.

They want to be able to detect a minimum effect size of 0.25. What value of the intra-

cluster correlation coefficient results in power = 0.80? Consider the case with and without

the covariate.

In Scenario 3, the intra-class correlation, ρ , is unknown. As a result, we want to

select the power vs. intra-class correlation (rho) option. This allows the intra-class

correlation to vary along the x-axis. Clicking on power vs. intra-class correlation (rho)

reveals a blank screen similar to Figure 3. However, ρ no longer appears on the toolbar

because it is the unknown quantity.

43

Follow the steps below to investigate the question.

Step 1: Click on power vs. intra-class correlation (rho).

Step 2: Click on δ on the toolbar and change delta(1) to 0.25. Leave delta(2) at 0.50 for

comparison purposes. Click ok. Figure 11 displays the screen that appears.

Figure 11. CRT - Power vs. ρ with n=50, J=20, andδ =0.25 and 0.50

Note that the legend reveals J=20, n=50, and α =0.05. Since the Scenario specifies J=40,

we need to change the setting for J. The settings for n and α are correct.

Step 3: Click on J on the toolbar. Change J(1) to 40 because there are 40 schools in the

example. Click ok. The new screen is in Figure 12.

44

Figure 12. Power vs. ρ with n=50, J=40, andδ =0.25 and 0.50

Step 4: Recall that in Scenario 3 we are trying to determine the intra-class correlation that

results in power of 0.80 with n=50, J=40, andδ =0.25. Using the legend to find the

trajectory that matches our specifications, click along the appropriate trajectory to

determine the value of the intra-class correlation that results in power = 0.80. The result

is ρ =0.055. Notice that as the intra-class correlation increases, or more of the variation is

due to between-cluster variation, the power of the test decreases, which is consistent with

the results in Chapter 1.

Step 5: Click on in order to include the covariate in the power analysis. Leave

(1) equal to 0 but set (2) equal to 0.49. Let’s remove the extra effect size in

order to keep the screen manageable. Click on

22LR

22LevelR 2

2LevelR

δ and delete delta (2). Figure 13 displays

the new screen.

45

Figure 13. Power vs. ρ with n=50, J=40, δ =0.25, and =0 and 0.49. 22LR

Note that clicking along the dotted trajectory reveals that by including the covariate, a

ρ equal to 0.11 will achieve power = 0.80. In other words, by including the cluster-level

covariate, we can have a larger unconditional ρ and still achieve the desired power.

3.2.4 Scenario 4 – Unknown minimum effect size

The researchers have 40 classrooms in the study and 30 students per classroom.

Based on past studies, the intra-class correlation coefficient is 0.05. What is the minimum

effect size the researchers can detect with power = 0.80? Consider the case with and

without the covariate.

In Scenario 4, the minimum effect size is unknown. As a result, we want to select

the power vs. effect size (delta), which allows the effect size to vary along the x-axis.

Clicking on power vs. effect size (delta) reveals a blank screen with a toolbar similar to

Figure 2. However, δ no longer appears on the screen because it is the unknown quantity.

Follow the steps below to answer the question.

Step 1: Click on power vs. effect size (delta).

46

Step 2: Click on J on the toolbar. Change J(1) to 40 since there are 40 clusters in the

study. Click ok. Figure 14 displays the screen that appears.

Figure 14. CRT - Power vs. δ with n=50, J=40, and ρ =0.05 and 0.10

Note that the legend shows that n=50, ρ =0.05, and 05.0=α so we do not need to change

any of the settings. However, we can delete ρ =0.10 since it is not required for this

design. Click on ρ and delete rho (2). Figure 15 displays the new screen.

47

Figure 15. CRT - Power vs. δ with n=50, J=40, and ρ =0.05.

Recall that in Scenario 3 we are trying to determine the minimum effect size that results

in power of 0.80 with n=50, J=40, and ρ =0.05. Clicking along the trajectory until the

power is 0.80 we can see that the minimum effect size the researchers can detect with

power = 0.80 and no covariate is approximately 0.24.

Step 3: Click on in order to include the covariate in the power analysis. Leave

(1) equal to 0 but set (2) equal to 0.49. Figure 16 displays the new screen.

22LR

22LevelR 2

2LevelR

48

Figure 16. CRT - Power vs. δ with n=50, J=40, ρ =0.05 and =0 and 0.49. 22LR

Including the covariate in the design allows the researchers to find a minimum detectable

effect size equal to 0.19.

3.2.5 Scenario 5 – Unknown explanatory power of the cluster-level covariate

The researchers have 40 classrooms in the study and 30 students per classroom.

Based on past studies, the intra-class correlation coefficient is 0.05. They want to detect a

minimum detectable effect size of 0.25. Under these constraints, how much of the cluster-

level variation does the covariate need to explain in order to achieve power = 0.80.

In Scenario 5, the explanatory power of the cluster-level covariate is unknown. As

a result, we want to select the power vs. proportion of explained variation by level 2

covariate (R2). Clicking on power vs. proportion of explained variation by level 2

covariate (R2) reveals a blank screen with a toolbar similar to Figure 2. However, no

longer appears on the screen because it is the unknown quantity.

22LR

Follow the steps below to answer the question.

49

Step 1: Click on default settings.

Step 2: Click on n and set n=30.

Step 3: Click on J and set J=40.

Step 4: Click on δ and set delta (1) = 0.25. Delete delta (2).

Step 5: Click on ρ and delete rho (2). The final screen is in Figure 17.

Figure 17. CRT - Power vs. with n=30, J=40, 22LR δ =0.25, and ρ =0.05.

Clicking along the trajectory reveals that if the covariate explains 13% of the variation in

the level-2 outcome, the power equals 0.80.

3.3 Optimal Sample Allocation Under Budgetary Constraints

This section focuses on planning a cluster-randomized study with a fixed budget.

Throughout this section we assume sampling costs for the treatment group are the same

as those for the control group. For example, imagine that for the literacy example

described in Section 2.2, the cost of sampling each school is $500, regardless of whether

the school receives the new program or not. The cost of sampling a student within each

50

school is $25. Also imagine that the total budget for the study is $20,000. Knowing the

sampling costs allows the researcher to answer 2 questions.

1. What is the optimal n that minimizes the variance of the treatment effect

under these budgetary constraints?

2. What is optimal n and J to maximize the power under these budgetary

constraints?

Both questions can be answered using the OD software and are discussed in Section 2.3.1

and 2.3.2.

3.3.1. Optimal n vs. ρ to minimize variance

The optimal design software allows a researcher to determine the optimal n that

minimizes the variance of the treatment effect for various values of ρ . Figure 18

displays the screen for the Optimal n vs. rho to minimize variance option.

Figure 18. Optimal n vs ρ screen.

Below is a description of the first three buttons on the toolbar since they differ from

previous screens.

C2/C1 – specifies the cost ratio. C2 is the cost per cluster and C1 is the cost per person.

51

<x<x – sets the range of the x-axis. The x-axis displays the possible values of ρ . By

default ρ ranges between 0.01 and 0.25.

<y< - sets the range of the y-axis. The y-axis displays the values of optimal n.

Let’s use the software to answer question 1. Recall the questions asks for the optimal n

that minimizes the variance of the treatment effect when C2 =$500, C1 =$25, and the total

cost is $20,000. Follow the steps below to use answer the question.

Step 1: Click on Optimal Sample Allocation under Budgetary Constraints – Equal Costs–

Optimal n vs. rho to minimize variance.

Step 2: Calculate C2/C1 = 500/25 = 20. Click on C2/C1. By default the cost ratios are set

to 5 and 20. By leaving C2/C1 (1)=5 and C2/C1 (2)=20, we can make comparisons f

different cost ratios. Click ok. Figure 19 displays the new screen.

or

Figure 19. Optimal n vs ρ

Note the legend identifies the two trajectories based on the cost ratio.

Step 3: Clicking along the trajectory for C2/C1 = 20 we can investigate the optimal n for

different values of ρ . For example, if ρ = 0.05, then optimal n is approximately 20.

52

Step 4: The software does not calculate the corresponding J based on optimal n and total

cost. However, we can do this by hand using the formulas from Chapter 1.

20500)25*20(

000,20

21

=+

=+

=CnC

TJ

So to minimize the variance of the treatment effect in this situation, we need 20 people

per school and 20 schools.

3.3.2. Maximizing Power

The OD software can also be used to calculate the optimal n and J to maximize

the power for a study. The optimal n to maximize power will generally be close to the

optimal n needed to minimize variance. However, in settings with small J, the results will

differ somewhat.

Because this option maximizes the power, a minimum effect size must also be

specified. Recall question 2 asked for the optimal n and J to maximize the power when

the cost of sampling a cluster is $500, the cost of sampling a person is $25, and the total

cost is $20,000. To use the software, ρ , δ , and α must be specified. Assume ρ =0.05,

δ =0.30, and α =0.05. Follow the steps below to calculate the optimal n and J for the

above specifications.

Step 1: Click on Optimal Sample Allocation under Budgetary Constraints – Equal Costs–

Maximizing Power. Figure 20 displays the screen.

Figure 20. Optimal Sample Allocation – Maximize Power

53

Step 2: Specify the appropriate input by clicking on each box and changing the value

match the criteria set forth in the example. Figure 21 displays the correct input.

Figure 21. Optimal Sample Allocation – Maximize Power

Step 3. Click Compute. The optimal n, J and power are now displayed. Figure 22 shows

the final screen.

Figure 22. Optimal Sample Allocation – Maximize Power

Notice that the optimal n and J are both 20, which is the same as the results we calculated

in section 2.3.1 for 05.0=ρ . We expect this to be the same because minimizing the

variance of the treatment effect is the same as maximizing the power. The only difference

in this option is that we are also specifying a minimum effect size, which allows us to

54

calculate the power. If we increase the minimum effect size to 0.40, the power increases

to 0.77 but the optimal n and J remain the same because the cost ratio and ρ (which

determine the optimal sample sizes) are not influenced by the effect size.

55

4. Cluster Randomized Trials with Binary Outcomes

Chapters 1, 2, and 3 discuss cluster randomized trials with continuous outcomes.

In Chapters 4 and 5 we investigate power for a cluster randomized trial with a binary

outcome. Chapter 4 provides a brief conceptual background and chapter 5 describes how

to use the Optimal Design Software to design a cluster randomized trial with a binary

outcome.

4.1 General Description of the CRT with a Binary Outcome

The general design of a CRT with a binary outcome is the same as a CRT with a

continuous outcome: students nested within schools, or more generally, the level-1 units

nested within the level-2 unit. However, the outcome variable is different. For example,

the outcome for a study might be whether or not a student drops out of school or whether

or not a student drinks alcohol in high school. The variable has only two possibilities so it

is a binary outcome. Because of the structure of the data, the model for a CRT with a

binary outcome is different than the model for a CRT with a continuous outcome. Let’s

take a closer look at the model.

4.2 The Model

The model for a CRT with binary outcome can be thought of as an extension of

the generalized linear model applied to a multi-level setting. The level-1 model is

comprised of three parts: the sampling model, the link function, and the structural model.

The level-1 sampling model defines the probability that the event will occur. The

sampling model is below:

),(~| ijijijij mBY φφ (1)

for persons per cluster and for },...,2,1{ jni ∈ },...,2,1{ Jj ∈ clusters;

where is the number of trials for person i in cluster j; and ijm

ijφ is the probability of success for person i in cluster j.

56

The expected value and variance of Y ijij φ| are:

ijijijij mYE φφ =)|(

)|( ijijij mYVar

)1( ijij φφφ −= (2)

Note that in the case of a Bernoulli trial, = 1 so the expected value of ijm ijijY φ| reduces

to ijφ and the variance reduces to ijφ (1- ijφ ). A common link function for a binary outcome

is the logit link:

⎟⎟⎠

⎞⎜⎜⎝

−=

ij

ijij φ

φη

1log (3)

where ijη is the log odds of success.

Let’s investigate the relationship between the probability of success, the odds of

success, and the log odds of success. If the probability of success, ijφ , is 0.50, then the

odds of success are 0.5/(1-0.5)=1, and the log odds of success is log (1)=0. If the

probability of success, ijφ , is greater than 0.5, then the odds of success are greater than 1,

and the log odds of success is positive. If the probability of success, ijφ , is less than 0.5,

then the odds of success is less than 1 and the log odds of success is negative.

The third part of the level-1 model is the structural model:

jij 0βη = (4)

where j0β is the average log odds of success per cluster j.

The level 2 model is the same as the level-2 model for a CRT with a continuous

outcome. However, the interpretation of the parameters differs because of the logit link

function:

jjj uW 001000 ++= γγβ , ),0(~0 τNu j (5)

where 00γ is the average log odds of success across clusters;

57

is the treatment effect in log odds; 01γ

is ½ for treatment and -½ for control; jW

j is the random effect associated with each cu0 luster mean; and

τ is the between cluster variance in log odds.

.3 Testing the Main Effect of Treatment

the

4

The framework for testing the main effect of treatment in the case of a binary

outcome variable is very similar to the case of a continuous outcome variable. In

model above (equation 5), the treatment effect is denoted 01γ . It is estimated by:

CE ηηγ −=1 (6)

where E

η is the predicted mean for the experimental group in logs odds and Cη is the

predicted mean for the control group in log odds. The variance of the estimated treatment

ffect can be approximated by:

e

JnVar )/(4)ˆ(

2

01στγ +

= (7)

where =2σ 2/)1(

1)1(

1⎟⎟⎠

⎞⎜⎜⎝

⎛−

+− CCEE φφφφ

. (8)

lows a non-central Z-distribution. The non-centrality

arameter is given below:

The test statistic fol

p

)ˆ( 01

01

γγλ

Var= =

)/)/(4 2

01

Jnστ

γ

+ (9)

1 skedastic, which makes the meaning of the intra-

class co relation

In the case of a binary outcome, we do not typically standardize the model

because the level- variance is hetero

r , ρ , uninformative.

58

4.4 Example

Suppose a team of researchers wants to determine whether a new drug prevention

program for middle school students reduces the probability that a student does drugs.

outcome variable is whether or not the student does any drugs prior to entering high

school. Assume that the school mean probability that a student tries any type of drugs

before high school is 0.4, and that the school means vary such that the lower bound is 0.1

and the upper bound is 0.6. The researchers expect that the school mean probability tha

student who participates in the program will try drugs is 0.25. Thus far the researchers

have recruited 40 total schools, 20 that will implement the new drug prevention prog

and 20 that will continue with their current policies. Within each school, they have

recruited 100 students. What power do they have to d

Th

t

ram

etect the desired treatment effect?

e

a

igure 1 displays the graph from the OD Software.

Figure 1. Power vs. cluster size.

F

1.0 = 0.050 αφE = 0.250000

0.9 φ

Cluster size

Power

31 60 89 118 147

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

C = 0.400000low er plausible value = 0.100000upper plausible value = 0.600000

J= 40

e that with a cluster size of 100, the power to detect the

treatment effect is 0.88.

From Figure 1, we can se

59

5. Using the Optimal Design Software for Cluster Randomized Trials with Binary

Outcomes

This chapter focuses on how to use the OD software to design a cluster

randomized trial with a binary outcome. Section 5.1 provides general information for

how to use the CRT with a binary outcomes module. Section 5.2 provides an example

and details regarding the options within the module.

5.1 General Information

The CRT with a binary outcome option allows the researcher to calculate the

power for the average treatment effect as a function of the cluster size (n), the number of

clusters (J), and the probability of success in the experimental condition ( Eφ ). The menu

is in the Cluster Randomized Trial module and is given below:

Power for the main effect of treatment (binary outcome)

Power vs. cluster size (n)

Power vs. number of clusters per site (J)

Power vs. probability of success in the experimental condition (phi(E))

5.2.1 Example

A team of researchers is investigating the effects of a new “Stay in School

Campaign.” They believe that students that participate in the program are more likely to

graduate from high school than students who do not participate in the program. The

program targets 12th grade students. The program is implemented at the school level thus

we have a nested data structure of students within schools. The outcome for the study is

whether or not a student graduates from high school in 4 years. Based on past data, the

researchers expect the probability that a student graduates from high school in 4 years to

be 0.6. The researchers are unsure how to plan the cluster randomized trial. Three

scenarios they might encounter are described in the remainder of the chapter.

60

5.2.2 Scenario 1

The researchers anticipate the probability that a student graduates to be 0.75 in

schools that adopt the new “Stay in School Campaign.” They have a total of 40 schools,

20 in the experimental group and 20 in the control group. How many students do they

need from each school to detect their desired treatment effect? Assume the bounds for the

school mean proportion graduating in the control schools are [0.5, 0.9].

In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size

(n) option. Figure 1 displays the screen.

Figure 1. CRT with binary outcome screen.

The buttons in the toolbar are explained below.

α - specifies the significance level, or chance of a Type I error. By default, α is set at

0.05, which is a common level for most designs.

J – specifies the number of clusters. By default, J is set at 20.

Eφ - specifies the probability of success in the treatment condition. By default, Eφ is set at

0.60.

61

Cφ - specifies the probability of success in the control condition. By default, Cφ is set at

0.40.

PI – specifies the 95% plausible interval for . This is the range the researcher would

expect for the school mean probability of success for the schools in the control group.

Cjφ

The remaining options in the toolbar are the same as those in the CRT option for

continuous outcomes. The details can be found in Chapter 3.

Follow the steps below to answer the questions:

Step 1: Click on Power vs. cluster size (n).

Step 2: Click on J. Set J = 40.

Step 3: Click on Eφ . Set Eφ =0.75.

Step 4: Click on Cφ . Set Cφ =0.60.

Step 5: Click on PI. Set the lower bound = 0.50 and the upper bound = 0.90. Note that I

also set the x-axis from 1 to 150. The resulting graph appears in Figure 2.

Figure 2. Power vs. cluster size.

62

From the figure, we can see that the researchers need approximately 16 students per

cluster to achieve power = 0.80.

5.2.3 Scenario 2

The researchers anticipate the probability that a student graduates to be 0.75 in

schools that adopt the new “Stay in School Campaign.” They have a total of 50 students

per school. How many schools do they need from each school to detect their desired

treatment effect? Assume the bounds for the school mean proportion graduating in the

control schools are [0.5, 0.9].

In Scenario 2, the number of clusters is unknown thus we select the power vs.

number of clusters (J) option.

Follow the steps below to answer the questions:

Step 1: Click on Power vs. number of clusters (J).

Step 2: Click on n. Set n = 50.

Step 3: Click on Eφ . Set Eφ =0.75.

Step 4: Click on Cφ . Set Cφ =0.60.

Step 5: Click on PI. Set the lower bound = 0.50 and the upper bound = 0.90. The

resulting graph appears in Figure 3

63

Figure 3. Power vs. number of clusters.

Clicking along the trajectory reveals that 28 clusters are required to achieve power =

0.80. This would mean there would be 14 clusters assigned to treatment and 14 clusters

assigned to control.

5.2.4 Scenario 3

The researchers expect to secure 40 total schools and 50 students per school.

What is the smallest probability of success the researcher can detect for the experimental

group with power = 0.80? Assume the bounds for the school mean proportion graduating

in the control schools are [0.5, 0.9].

In Scenario 3, the probability of success in the treatment group is unknown thus

we select the power vs. probability of success in the treatment group (phi(E)) option.

Follow the steps below to answer the questions:

Step 1: Click on Power vs. probability of success in the treatment group (phi(E)).

64

Step 2: Click on n. Set n = 50.

Step 3: Click on J. Set J = 40.

Step 4: Click on Cφ . Set Cφ =0.60.

Step 5: Click on PI. Set the lower bound = 0.50 and the upper bound = 0.90. The

resulting graph appears in Figure 4.

Figure 4. Power vs. probability of success in the treatment group.

Note that this plot looks different from previous graphs. There is high power for

probabilities that are very different from 0.60, the probability of graduation for the

control group. In other words, big differences in probabilities for treatment and control

are easier to detect. The low power at 0.60 is also logical because if the probability of

graduation was very similar in both groups, the power to detect the difference would be

very low. In our example, we expect to increase the probability of graduation for students

in the treatment group. Thus we look to the right of 0.60 (the probability of graduation for

the control group). Clicking along the trajectory to the right of 0.60, we can see that with

power = 0.80, the researchers can detect a probability of graduation for the treatment

65

group equal to 0.72. In other words, if the researchers believe that the new program will

increase the probability of graduation by at least 0.12, then they have power = 0.80 for

the study.

66

6. Multi-site Cluster Randomized Trials

As discussed in Chapters 1, 2, and 3, researchers designing cluster randomized

trials are often limited in the number of clusters they can afford, resulting in studies that

lack statistical power. In Chapter 2, we investigated the effects of including a cluster-

level covariate in the design and analysis of a cluster randomized trial on the power of the

test. Chapters 6 and 7 explore another method that is commonly used to increase

statistical power in cluster randomized trials, known as blocking. In addition, we

investigate the effects of blocking and including a cluster-level covariate on the power of

the test. Chapter 6 provides a brief conceptual background, and chapter 7 describes how

to use the Optimal Design software to design a cluster randomized trial using blocking.

6.1 Why block?

Blocking is a commonly used technique in experimental design and is frequently

used in individual randomized trials. In this chapter, we extend the idea of blocking to a

cluster randomized trial and focus on the use of pre-randomization blocking to improve

the precision of the estimates and increase the power of the tests. The basic idea of pre-

randomization blocking is to find sites or blocks where clusters within the sites are very

similar with respect to the outcome variable. This reduces the heterogeneity within sites

or “blocks”, increasing the precision of the treatment effect estimate, hence increasing the

power of the test for the main effect of treatment.

To illustrate, imagine that researchers develop a new reading program for

elementary school students. We know that prior school mean test scores, ethnic

composition, and socioeconomic status are related to school mean reading achievement.

We might therefore assign the school to “blocks” that are similar on mean prior test

scores, ethnic compositions, and mean socioeconomic status. Within each block, we

randomize schools to receive the new reading program or the regular program. This

reduces the variance in the estimate of the treatment effect because by dividing schools

into blocks we are able to remove the between-block variance from the error variance.

67

The between-block component is likely to be large, so removing it greatly reduces the

variance of the estimate.

A design using blocking before randomizing groups can be thought of as a multi-

site cluster randomized trial, an extension of the cluster randomized trial. In a multi-site

cluster randomized trial, the site is the block and clusters are randomly assigned to

treatment and control within each site. Sometimes the sites are natural administrative

units, for example, schools where classrooms are randomly assigned to treatment within

schools. The remainder of this chapter and Chapter 6 will refer to a design that utilizes

blocking as a multi-site cluster randomized trial.

Designing a multi-site cluster randomized trial requires that the researcher

calculate the power for the average treatment effect. The power to detect the main effect

of treatment in a multi-site cluster randomized trial is slightly more complicated than in a

cluster randomized trial. In a typical multi-site cluster randomized trial, power is a

function of the minimum detectable effect size, δ , the intra-class correlation, ρ , the

effect size variability, , the number of sites, K, the number of clusters per site, J, and

the cluster size, n, while holding

2δσ

α constant. The effect size variability, , is not

estimable in a cluster randomized trial because we calculate only one effect size, the

difference between the control and experimental groups. However, in a multi-site cluster

randomized trial, the experiment is replicated within sites. This allows us to estimate an

effect size for each site. Thus we are able to estimate the variance of the effect size. For

reasons discussed later in this chapter, it can also be useful to calculate the power for the

variance of the treatment effect. Power for the treatment effect variability is a function of

the intra-class correlation,

2δσ

ρ , the effect size variability, , the number of sites, K, the

number of clusters, J, and the cluster size, n, while holding

2δσ

α constant. Note that ,δ the

standardized main effect of treatment, is not a part of this power calculation. Both tests

are discussed separately following a discussion of the model for a multi-site cluster

randomized trial.

68

In many cases, the sites will be regarded as randomly sampled from a larger

universe or “population” of possible sites. The larger universe is the target of

generalization. For example, if schools are sampled and then classrooms are assigned at

random to treatments within schools, the target of any generalizations will often be the

larger universe of schools from which schools in the study are regarded as a

representative sample.

In other cases, the sites will be regarded as fixed. Consider a program designed to

teach students about the dangers of drugs. The outcome for the study is students’ attitude

towards drugs, which is measured by a questionnaire. The researchers hypothesize that

the school setting - suburban, urban, or rural - affects students’ attitude towards drugs.

Thus they want to block on the setting. In this case, suburban, urban, and rural are not

regarded as sampled from a population of settings, but rather as fixed blocks or sites.

Whether we view sites as fixed or random affects the data analysis and planning

for adequate power to detect the treatment effect. Sections 6.2-6.6 explain how to plan

studies in which sites are regarded as random. Section 6.7 describes how to modify these

procedures for the case in which sites are regarded as fixed.

6.2 The Random Effects Model

We can represent data from a multi-site cluster randomized trial as a three level

model, persons nested within clusters nested within sites. The level-1 model, or person-

level model is:

ijkjkijk eY += 0π (1) ),0(~ 2σNeijk

for persons per cluster, },...,2,1{ ni ∈ },...,2,1{ Jj ∈ clusters and },...,2,1{ Kk ∈ sites,

where jk0π is the mean for cluster j in site k;

is the error associated with each person; and ijke

is the within-cluster variance. 2σ

69

The level-2 model, or cluster-level model, is:

jkjkkkjk rW 001000 ++= ββπ ),0(~0 πτNr jk (2)

where k00β is the mean for site k;

k01β is the treatment effect at site k;

is a treatment contrast indicator, ½ for treatment and -½ for the control; jkW

jkr0 is the random effect associated with each cluster; and

πτ is the variance between clusters within sites.

The level-3 model, or site-level model, is:

kk u0000000 += γβ var 00

~)( 00 βτku

kk u0101001 += γβ var 11

~)( 01 βτku 01

),cov( 0100 βτ=kk uu (3)

where 000γ is the grand mean;

010γ is the average treatment effect (“main effect of treatment”);

is the random effect associated with each site mean; ku00

is the random effect associated with each site treatment effect; ku01

00βτ is the variance between site means;

11βτ is the variance between sites on the treatment effect; and

01βτ is the covariance between site-specific means and site-specific treatment

effects.

The random effects and are typically assumed bivariate normal in

distribution. We are interested in two quantities, the main effect of treatment,

ku00 ku01

010γ , and

the variance of the treatment effect, 11βτ . Note that we are operating under a random

effects model. In a fixed effects model, the variance of the treatment effect,11βτ , would be

0. Section 6.3 focuses on the power for the main effect of treatment for a random effects

model. Section 6.6 discusses power for the treatment effect variability.

70

6.3 Testing the Average Treatment Effect

The average treatment effect is denoted as 010γ in level 3 of the model. Given a

balanced design, it is estimated by

(4) __

010 CE YY −=∧

γ

where is the mean for the experimental group and is the mean for the control

group.

EY_

CY_

Note that the estimated main effect of treatment looks like that in the cluster

randomized trial except that now we are summing over clusters and sites. Thus the

variance of the treatment effect is slightly different than in a cluster randomized trial. It is

estimated by (Raudenbush and Liu 2000)

K

JnVar

/)/(4)ˆ(

2

01011

σττγ πβ ++

= . (5)

The main difference between the variance of the treatment effect in a multi-site cluster

randomized trial and that in a cluster randomized trial is that we now have four sources of

variability, the within-cluster variance, , the between-cluster variance or within-site

variance,

πτ , the between-site variance, 00βτ , and the between-site variance in the

treatment effect, 11βτ .

If the data are balanced, we can use the results of a nested analysis of variance

with random effects for the clusters and sites and fixed effects for the treatment. Similar

to prior tests, the test statistic is an F statistic. The F test follows a non-central F

distribution, F(1, K-1; )λ . Recall that the noncentrality parameter is a ratio of the

squared-treatment effect to the variance of the treatment effect estimate. Below is the

noncentrality parameter for the test.

JnK

/)/(4)var(2

2010

010

2010

11σττ

γ

γ

γλπβ ++

== ∧ . (6)

71

Recall that the larger the non-centrality parameter, the greater the power of the

test. By looking at the formula, we can see that K, the number of sites, has the greatest

impact on the power. It is especially important to have a large K if there is a lot of

between-site variance. Increasing J also increases the power but is not as important as K.

J becomes more important if there is a lot of variability between clusters. Finally,

increasing n does increase the power, but has the smallest effect of the three sample sizes.

Increasing n is most beneficial if there is a lot of variability within clusters. In addition to

K, J, and n, a larger effect size increases power. Note that11βτ , the between-site variance

of the treatment effect, appears in the denominator of the non-centrality parameter. As

mentioned above, if the variance of the treatment effect across sites is large, it is

particularly important to have a large number of sites to counteract the increase in

variance in order to achieve adequate power. However, if the variability of the impact

across sites is very large, the average treatment effect may not be informative. Section 5.6

discusses the importance of the variance of the treatment effect.

Thus far, we have focused on the unstandardized random effects model for a

multi-site cluster randomized trial. However, we know that researchers often talk in terms

of standardized effect sizes and standardized effect size variability. Recall in chapter 1,

we adopted Cohen’s definition for standardized effect sizes, with 0.20, 0.50, and 0.80 as

small, medium, and large effect sizes. We continue with these same rules of thumb in this

chapter. In a multi-site cluster randomized trial, we also need to standardize the variance

of the effect size. The magnitude of the effect size variability depends on the desired

minimum detectable effect. For example, an effect size variance of 0.10 is the same as a

standard error of approximately =10.0 0.31. If a researcher desires a minimum

detectable effect of 0.20, a standard error of 0.31 is too large and would indicate a lot of

uncertainty in the estimate. For an effect size of 0.20, an effect size variance of 0.01 (or

standard error of 0.10) is more reasonable. Let’s see how we translate the unstandardized

model into a standardized model.

72

In the standardized model, the within-cluster variance, , and the between-

cluster variance,

πτ , sum to 1. The intra-cluster correlation, ρ , is defined as

2σττ

ρπ

π

+= . Since , we can rewrite 12 =+ στ π ρτ π = and . This notation is

the same as the notation for the standardized model for the cluster randomized trial. It is

important to recognize that in the multi-site cluster randomized trial we remove the

between-block variability so

ρ−= 1σ 2

ρ is the between cluster variance relative to the total

variance within blocks. However, in the program we ask the use to specify the intra-class

correlation, standardized effect size, and effect size variability prior to blocking as well as

the percentage of variance explained by blocking in order to simplify the calculations for

the user.

In standardized notation, the non-centrality parameter, ,λ can be rewritten as:

JnK

/]/)1([42

2

ρρσδλ

δ −++= (7)

where the intra-cluster correlation, ρ , is:

2σττ

ρπ

π

+= ,

or the variance between clusters relative to the between and within cluster variation

within blocks; δ is the standardized main effect of treatment,

2

010

στ

γδ

π +=

and is the variance of the standardized treatment effect, 2δσ

22 11

σττ

σπ

βδ +

= .

It is important to be familiar with the standardized model because it is common

among researchers to use the standardized values and the Optimal Design Software

operates with the standardized notation which requires the researcher be able to identify

,,δρ and Note that power now depends on n, J, K, .2δσ ,ρ ,δ and .2

δσ

73

6.4 Including a Cluster-level Covariate

In addition to blocking, researchers may also have cluster-level covariates

available. The cluster-level covariate in a multi-site randomized trial functions similarly

to the cluster-level covariate in a cluster randomized trial discussed in Chapters 2. Recall

that including a cluster-level covariate influences the power of the test depending on the

strength of the correlation between the covariate and the true cluster mean outcome, or

how much of the variability in the true cluster mean outcome is explained by the

covariate. The proportion of explained variability is denoted . The larger , the

smaller the conditional level 2 variance,

2R 2R

x|πτ , relative to the unconditional level 2

variance, πτ , and the greater the benefit of the covariate in increasing precision and

power.

6.5 The Models and Treatment Effect Estimates

The level 1 model for a multi-site cluster randomized trial with a cluster-level

covariate looks the same as the level-1 model for a regular multi-site cluster randomized

trial (see equation 1). The level 2 model looks different because it includes the cluster

level covariate. It is written as:

jkjkkjkkkjk rXW 00201000 +++= βββπ ),0(~ |0 xjk Nr πτ (8)

Note: ππ ττ )1( 2| Rx −=

where k00β is the adjusted mean for site k;

k01β is the adjusted treatment effect at site k;

k02β is the regression coefficient for the cluster-level covariate at site k;

jkW is 0.5 for treatment and –0.5 for control;

jkX is the cluster level covariate, typically centered to have mean 0;

jkr0 is the random effect associated with each cluster; and

x|πτ is the residual variance conditional on the cluster-level covariate . jkX

74

Note that the between cluster variance is now a residual variance conditional on the

cluster-level covariate . jkX

The level 3 model is now:

kk u0000000 += γβ ),0(~ |00 00 xk Nu βτ (9)

kk u0101001 += γβ ),0(~1101 βτNu k

02002 γβ =k

where 000γ is the grand mean;

010γ is the average treatment effect (“main effect of treatment”);

020γ is the regression coefficient for the cluster-level covariate, which is assumed

constant across sites;

ku00 is the random effect associated with each site mean;

ku01 is the random effect associated with each site treatment effect;

x|00βτ is the residual variance between site means; and

11βτ is the variance between sites on the treatment effect.

Because of the randomization, the true treatment effect is not influenced by the

covariate. Thus it is not necessary to have a conditional variance for the between-site

variation in the treatment effect. Note that we are also fixing the average regression

coefficient for the cluster-level covariate.

The estimate of the main effect of the treatment accounting for the cluster-level

covariate is:

. (10) )(ˆˆ__

020

__

010 CECE XXYY −−−= γγ

In words, it is the mean difference adjusted for the treatment group differences on the

covariate. To test the main effect of treatment we use an F-statistic which follows a non-

central F distribution, );2,1( xKF λ− where:

Jn

K

xx /)/(4 2

|

2010

11σττ

γλ

πβ ++= (11)

75

This formula for the noncentrality parameter looks similar to the noncentrality parameter

without the covariate except that the estimate of treatment effect is calculated differently

and the between cluster variance is now a conditional variance.

Following the same logic as the multi-site cluster randomized trial with no

covariate, it is important to standardize the model. The non-centrality parameter

expressed in standardized notation is:

[ ] JnK

x //)1(42

2

∗∗∗

−++=

ρρσδλ

δ

(12)

where the intra-cluster correlation, *ρ

2|

|*

σττ

ρπ

π

+=

x

x ,

or the conditional variance between clusters relative to the between and within cluster

variation within blocks; is the standardized main effect of treatment conditional on

the covariate,

2

|

010*

στ

γδ

π +=

x

;

and is the variance of the standardized treatment effect conditional on the covariate, *2δσ

2|

*2 11

σττ

σπ

βδ +

=x

.

Because the conditional standardized quantities resulting from inclusion of a

covariate are frequently unknown, the program asks the user to enter the unconditional

parameters, ,ρ ,δ and . The program calculates the conditional standardized values

based on the input.

2δσ

6.6 Testing the Variance of the Treatment Effect

Recall that in a fixed effects model we assume the treatment effect to be

homogeneous across the sites. Thus the tests described in this section are only applicable

76

under the random effects model where we assume the treatment effect differs across the

sites. To quantify this difference, we estimate the variance of the treatment effect across

the sites. The design, with treatments randomized to clusters within sites, allows us to

estimate this variability. If it is very large, it may be hiding the true treatment effect. For

example, imagine a multi-site cluster randomized trial that reports a standardized

treatment effect estimate of 0.23. The researchers claim that the new reading program

improves scores by 0.23 standard units. However, they fail to report that the standardized

treatment effect variability across sites is 0.30. The high variance may be hiding the fact

that some types of schools benefit from the program while other types of schools actually

suffer from the program. For example, there may be a differential effect by location,

where rural schools that adopt the program see positive effects but urban schools that

adopt the program see negative effects. Thus the researchers would need to investigate

moderating site characteristics. Reporting the average treatment effect alone may be very

misleading and is not recommended.

Because the variance of the treatment effect is critical in determining the

interpretation of a treatment effect estimate, it is important to be able to detect the

treatment effect variability with adequate power. The remainder of this section describes

how to calculate the power for the variance of the treatment effect. We will use the

standardized model notation since it is more common in practice and is required for the

Optimal Design software.

The null and alternative tests for the treatment effect variability are:

0: 20 =δσH

0 . : 21 >δσH

The null hypothesis states that the variance of the treatment impact across sites is null,

whereas the alternative hypothesis states that it is greater than 0. The test for the variance

of the treatment effect is an F test. The F statistic is:

77

Jn

JnF

/)1(4

/)1(42

∧∧

∧∧∧

−+

−++

=ρρ

ρρσ δ

. (13)

Note that the average effect size is not a part of the calculation, thus the power is based

on the number of sites, K, the number of clusters per site, J, the number of people per

cluster, n, the standardized effect size variability, and the intra-cluster correlation, ,2δσ .ρ

The F statistic follows a central F distribution with df = K-1, K(J-2). The ratio of the

expectation of the numerator to the expectation of the denominator is

]/)1([4

12

nJ

ρρσ

ω δ

−++= . (14)

Under the null hypothesis, we expect to be 0, thus2δσ 1=ω . As increases, 2

δσ ω gets

larger, which increases the power of the test. Thus the number of clusters within each site

is critical for increasing the power to detect the variance of the treatment effect across

sites. As the number of clusters within each site increases, so does the power to detect the

variability of treatment effects. Increasing K also increases the power, through the

degrees of freedom, but is not as important as increasing J. Note that this is the opposite

of what we found in the case of power for the treatment effect, where K is the most

significant factor in increasing power and J is less important.

Looking at equation 14, we can see that it will be difficult to achieve adequate

power to detect small values of , like 0.01 unless J is extremely large, which is

unlikely. This is not very problematic however, because our primary concern is to be able

to detect larger treatment effect variability since small values will not influence the

interpretation of the treatment effect.

2δσ

6.7.1 Example 1

To illustrate the use of blocking consider the two examples below. Researchers

have developed a new drug prevention program and are eager to test the effectiveness of

78

the program. The researchers decide to block on location. They choose 10 large

metropolitan cities across the United States from a population of large cities, and within

each city randomly assign schools to either the new drug prevention program or the

regular program. Assume 10 schools participate within each site and 50 students within

each school. Assume that the intra-class correlation before blocking, 15.0=ρ , and the

effect size variability, . Blocking accounts for 50% of the variation in the

outcome variable. The researchers want to be able to detect a minimum treatment effect

of 0.20. What power do the researchers have to detect this treatment effect? Assume that

the researchers also have access to a cluster level covariate that explains 49% of the

variability in cluster-mean outcomes after accounting for blocking. How does this change

the power to detect the treatment effect? Ignoring the covariate, what power do the

researchers have to detect the variance of the treatment effect?

01.02 =δσ

Below is a plot from the Optimal Design software with the specifications listed in

the example with no cluster level covariate. Note that the number of sites is allowed to

vary along the x-axis as a function of the power.

Figure 1. Power vs. Number of Sites (Power for Treatment Effect).

Number of sites

Power

8 11 14 17 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 50 J = 10

δ= 0.20,ρ= 0.15,σ2δ=0.010,Β=0.50

First, note that the legend in the upper right corner matches the specifications set forth in

the example. Clicking along the trajectory at K=10 sites reveals that the power is

79

approximately 0.75. Let’s see what happens when we include a cluster-level covariate

that explains 49% of the variation in the cluster-level outcome. Figure 2 displays the

trajectory with and without the cluster-level covariate.

Figure 2. Power vs. Number of Sites (Power for Treatment Effect).

Number of sites

Power

0.3

0.2

0.1

8 11 14 17 20

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 50 J = 10

δ= 0.20,ρ= 0.15,σ2δ=0.010,Β=0.50

δ= 0.20,ρ= 0.15,σ2δ=0.010,Β=0.50,R2

L2=0.49

Clicking along the dotted trajectory reveals for K=10 reveals power is approximately

0.87. Thus we can conclude that including the cluster-level covariate increases the power

to an adequate level to detect the main effect of treatment.

We can also plot the power for the variance of the treatment effect. The plot does

not include the cluster-level covariate. Figure 3 displays the plot for the variance of the

treatment effect.

80

Figure 3. Power vs. Number of Sites (Power for Effect Size Variability).

Number of sites

Power

7 10 13 16 19

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 50 J = 10

ρ = 0.15,σ2δ=0.010, Β=0.50

ρ = 0.15,σ2δ=0.100, Β=0.50

Clicking along the solid trajectory reveals that for 10 sites, the power to detect the

treatment effect variability is 0.14. Note that the very small effect size variability, 0.01,

makes it difficult to detect. The second trajectory on the graph sets the effect size

variability equal to 0.10. The remaining constraints are the same but the power to detect

the treatment effect is 0.83. In other words, as the size of the treatment effect variability

increases, the power to detect it increases dramatically.

6.7.2 Example 2

The second example compares a cluster randomized trial to a multi-site cluster

randomized trial with respect to the power to detect the treatment effect. Imagine a team

of researchers develops a new math program for 4th graders. They propose that students

who participate in the new program will have increased math achievement. The outcome

is math score on a specific math test at the completion of 4th grade. They propose two

designs and want to know which design will give them the most power to detect the

treatment effect.

Design 1 – Cluster Randomized Trial: The first design is a cluster randomized

trial. They plan to randomly assign 40 schools to either treatment or control. Within each

school, they plan to test 100 students. Based on past research, the researchers estimate an

81

intra-class correlation of 0.10. They want to be able to detect a minimum effect size of

0.25. What is the power of the test under this design?

Design 2 – Multi-Site Cluster Randomized Trial: The second design is a multi-site

cluster randomized trial. Based on past studies, the researchers know that the percent of

children in a school on free and reduced lunch is strongly related to achievement. The

researchers obtain 10 sites, blocked on the percent of students on free and reduced lunch.

Within each site they randomly assign 2 schools to treatment and 2 schools to control.

They still test 100 students within each school. Research indicates that blocking on

percent of children on free and reduced lunch reduces the between-school variation by

64%. Assuming the variability in effect sizes is small, 0.01, and the intra-class

correlation, 15.0=ρ , what is the power to detect a treatment effect of 0.25?

Design 1: Plugging the information into the Cluster Randomized Trial option, we

get the graph in Figure 4.

Figure 4. Power vs. number of clusters.

Number of clusters

Power

23 42 61 80 99

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 100

δ= 0.25,ρ = 0.10

Clicking on the trajectory at J=40 reveals that the power to detect an effect is 0.64. This

is not a very powerful design so let’s see how design 2 compares.

82

Design 2. Plugging the information into the Multi-Site Cluster Randomized Trial

option, we get the plot in Figure 5.

Figure 5. Power vs. number of sites (Power for Treatment Effect).

Number of sites

Power

8 11 14 17 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 100 J = 4

δ= 0.25,ρ= 0.10,σ2δ=0.010,Β=0.64

We can see that the larger the number of sites, the greater the power. Clicking along the

trajectory we can see that for 10 sites, the power is approximately 0.86. Note that 10 sites

with 4 schools at each site results in a total of 40 schools. Thus we use 40 schools in both

designs but we are able to increase the power from 0.64 to 0.86 by blocking on percent of

students on free and reduced lunch.

6.8 The Fixed Effects Model

Recall that we can represent data from a multi-site cluster randomized trial as a

three level model, persons nested within clusters nested within sites. The fixed effects

model is identical to the random effects model with a crucial exception: the site-specific

contributions and are designated as fixed constants rather than random variables. ku00 ku01

The level-1 and level-2 models are identical to models (1) and (2) in the random

effects case. The level-3 model, or site-level model is:

kk u0000000 += γβ

kk u0101001 += γβ (15)

where 000γ is the grand mean;

010γ is the average treatment effect (“main effect of treatment”);

83

ku00 , for , are fixed effects associated with each site mean,

constrained to have a mean of zero; and

},...,2,1{ Kk ∈

ku01 , for , are fixed effects associated with each site treatment

effect, constrained to have a mean of zero.

},...,2,1{ Kk ∈

We are interested in two kinds of quantities, the main effect of treatment, 010γ , and

the fixed treatment-by-site interaction effects , for ku01 },...,2,1{ Kk ∈ .

6.9 Testing the Average Treatment Effect

If the data are balanced, we can use the results of a nested analysis of variance

with random effects for the clusters and fixed effects for sites, treatments, and site-by-

treatment interaction. Similar to prior tests, the test statistic is an F statistic. The F test

follows a non-central F distribution, F(1, K(J-2); )λ . Recall that the noncentrality

parameter is a ratio of the squared-treatment effect to the variance of the treatment effect

estimate. Below is the noncentrality parameter for the test.

)/(4 2

2010

nKJ

στγ

λπ +

= . (16)

Recall that the larger the non-centrality parameter, the greater the power of the

test. By looking at the formula, we can see that KJ, the total number of clusters, has the

greatest impact on the power. Finally, increasing n does increase the power, but has the

smallest effect of the three sample sizes. Increasing n is most beneficial if there is a lot of

variability within clusters. In addition to K, J, and n, a larger effect size increases power.

Note that unlike the case of the random effects model,11βτ , the variance of the treatment

effect, does not appear in the denominator of the non-centrality parameter. However, if

the variation of the treatment effects across sites is large, the average treatment effect

may not be informative. Section 6.11 discusses the test of the variation site by treatment

effect variation under the fixed effects model. If the treatment effects vary across sites

with a fixed effects model, the main effect of treatment is interpreted with great caution.

84

In the fixed effects standardized model, the within-cluster variance, , and the

between-cluster variance,

πτ , sum to 1. The intra-cluster correlation, ρ , is defined as

2σττ

ρπ

π

+= . Since , we can rewrite 1=+τ π

2σ ρτ π = and . This notation is

the same as the notation for the standardized model for the cluster randomized trial. Note

that

ρσ =2 −1

ρ is the between cluster variance relative to the total variance within blocks. The

non-centrality parameter, ,λ can be rewritten in terms of the standardized model:

]/)1([4

2

nKJ

ρρδλ−+

= (17)

where ρ is the intra-cluster correlation ,

2σττ

ρπ

π

+= ,

or the variance between clusters relative to the between and within cluster variation

within blocks; and δ is the standardized main effect of treatment,

2

010

στ

γδ

π += .

6.10 Example

Let us compare the power to detect the treatment effect for a cluster randomized

trial to a multi-site cluster randomized trial with fixed effects. Consider a program

designed to teach students about the dangers of drugs. The researchers propose that

students who participate in the program will have a more positive attitude towards

staying away from drugs. The outcome is students’ attitudes towards drugs, which is

measured on a continuous scale. The researchers propose two designs and want to know

which design will give them the most power to detect the treatment effect.

Design 1 – Cluster Randomized Trial: The first design is a cluster randomized

trial. They plan to randomly assign schools to either treatment or control. Within each

85

school, they plan to test 100 students. Based on past research, the researchers estimate an

intra-class correlation of 0.15. They want to be able to detect minimum effect size of

0.25. How many schools are necessary to achieve power = 0.80?

Design 2 – Multi-Site Cluster Randomized Trial: The second design is a multi-site

cluster randomized trial. Based on past studies, the researchers know that the school

setting is strongly related to attitude towards drugs. Research indicates that blocking on

school setting, suburban, urban, or rural, reduces the between-school variation by 33%.

Within each school, the researchers test 100 students. How many schools are necessary

for each of the three sites if the researchers want to detect a minimum treatment effect of

0.25 with power = 0.80? Assume 15.0=ρ .

Design 1: Plugging the information into the Cluster Randomized Trial option, we

get Figure 6.

Figure 6. Power vs. number of clusters.

Number of clusters

Power

23 42 61 80 99

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 100

δ= 0.25,ρ = 0.15

In order to achieve power = 0.80, the researchers need to randomize a total of

approximately 82 schools.

Design 2: Figure 7 displays the power curve for the fixed effects multi-site cluster

randomized trial.

Figure 7. Power vs. number of clusters per site (Power for Treatment Effect).

86

Number of clusters/site

Power

14 20 26 32 38

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 100 K = 3

δ= 0.25,ρ= 0.15,σ2δ=0.000,Β=0.33

Curves with σ2δ set to 0.0 are

curves for the fixed effects model. Clicking along the power curve we can see that approximately 19 schools per site are

required to achieve power = 0.80. This is a total of 19*3, or 57 schools. By blocking, we

reduce the number of schools required to achieve power = 0.80 by 25 (82-57).

While the fixed effects model affords extra power for testing the main effect of

treatment, the interpretation of such a main effect requires great caution when treatment

effects vary across sites. We now turn to the question of testing site-by-treatment

variation within the fixed effects model.

6.11 Testing Site-by-Treatment Variation in the Context of a Fixed Effects Model.

Operationally, the test of the site-by-treatment variation in the case of the fixed

effects model is identical to that in the case of the random effects model (see Section 6.6

“Testing the Variance of the Treatment Effect”). The null hypothesis, however, differs.

Recall that in the case of the random effects model we test

0: 110 =βτH

or for the standardized random effects model, we test

0: 20 =δσA .

However, in the fixed effect model, the site-specific treatment effects are fixed

constants rather than random variables. Thus we have, in the non-standardized model

87

0: 201

10 =∑

−k

K

kuH .

As in the random effects case, we test this hypothesis using

cellwithinMSsitebytreatmentsMSJKKF =−− )]2(,1[ .

When the F test indicates rejection of , one emphasizes the estimation of site-

specific treatment effects (also known as “simple main effects” – see Kirk (1982), p. 365)

or post hoc procedures designed to identify subsets of sites for which the treatment effect

is homogeneous (see Kirk (1982), p. 317).

0H

88

7. Using the Optimal Design Software for Multi-site Cluster Randomized Trials

This chapter focuses on how to use the OD software to design a multi-site cluster

randomized trial. Section 7.1 provides general information for how to use the multi-site

cluster randomized trial software. Sections 7.2 and 7.3 provide examples and details

about the options within the multi-site cluster randomized trials option for the case of

random site effects. Section 7.4 provides an example in the case of fixed site effects.

7.1 General Information

The multi-site cluster randomized trials option allows the researcher to calculate

the power for the average treatment effect, the effect size variability, and the cluster

effect. Below is the menu:

Multisite Cluster Randomized Trials

Power for Treatment Effect

Power vs. cluster size (n)

Power vs. number of sites (K)

Power vs. clusters per site (J)

Power vs. intra-class correlation (rho)

Power vs. effect size (delta)

Power vs. effect size variability (sigma)

Power vs. proportion of explained variation by level 2 covariate (R2)

Power for Effect Size Variability

Power vs. cluster size (n)

Power vs. number of sites (K)

Power vs. clusters per site (J)

Power vs. intra-class correlation (rho)

Power vs. effect size variability (sigma)

89

7.2 Power for the Average Treatment Effect (Random Effects Model)

The first option is power for the treatment effect. We know that the power for the

treatment effect in a random effects model is a function of the cluster size, n, the number

of clusters, J, the number of sites, K, the intra-class correlation, ρ , and the effect size

variability, , the effect size, 2δσ δ , and the proportion of explained variation by level 2

covariate, denoted in the program. Thus we can calculate the power as a function of

any one of these components while holding the others constant. Section 7.2.1 provides an

example that will be used to illustrate how to use the software to calculate the power for

the treatment effect in a random effects model.

22LR

7.2.1 Example

The example below was introduced in Chapter 2 and is modified in this chapter.

Suppose a team of researchers develop a new literacy program. The founders of the new

program propose that students who participate in the program will have increased reading

achievement. They propose a three-level design with students nested within classrooms

within schools. In other words, they want to block on school. By blocking on school, the

researchers expect to explain 40% of the variation in the outcome variable. They plan to

test students who are in classrooms that participate in the new program (experimental

group) and students who are in classrooms that participate in the regular program (control

group) in each of the schools using a reading test to determine if students using the new

program score higher. However, they are unsure how to proceed with respect to the

number of students they should test in each classroom, the number of classrooms in each

school, and the number of schools in order to conduct a trial with power = 0.80. Five

scenarios the researcher might encounter are presented below. Assume 05.0=α for each

case.

7.2.2 Scenario 1

Based on past studies, the researchers estimate 15.0=ρ , , and want to

be able to detect a minimum standardized effect size of 0.20. Assuming 15 schools are

willing to participate as well as 10 classrooms within each school, how many students

within each classroom are necessary to achieve power = 0.80? What if the researchers

01.02 =δσ

90

include a cluster-level covariate that explains 49% of the variation in the cluster level

mean. How many children per classroom are necessary to achieve power = 0.80?

In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size

(n) option. Figure 1 displays the screen.

Figure 1. Multi-Site Cluster Randomized Trial Screen.

Let’s take a closer look at the function of each of the buttons on the toolbar.

α - specifies the significance level, or chance of a Type I error. By default, α is set at

0.05, which is a common level for most designs.

K – specifies the number of sites. By default, K is set at 12.

J – specifies the number of clusters within each site. By default, J is set at 10.

δ - specifies the minimum effect size of interest. Note this is the expected effect size

before blocking. By default, the minimum effect size is set at 0.20. 2δσ - specifies the effect size variability. By default, it is set at 0.01.

ρ - specifies the intra-class correlation before blocking. Thus it is the between-cluster

plus between-block variance divided by the within-cluster plus between-cluster plus

91

between-block variance. By default, it is set at 0.01 and 0.10. Trajectories for both intra-

class correlations are plotted so they can be compared.

B – specifies the proportion of variance explained by the blocking variable. By default, it

is set to 0. 2

2LR - specifies the correlation between the cluster-level covariate and the cluster mean

outcome. By default, it is set to 0.

The remaining options on the toolbar are the same as those in the Cluster Randomized

Trial option, which are explained in detail in Chapter 3.

Now let’s use the software to explore the question in Scenario 1. To answer the

question, click on Power vs. cluster size (n). Then move along the toolbar and specify

K=15, J=10, ,20.0=δ , 01.02 =δσ 15.0=ρ , B = 0.40, and = 0.00 and 0.49. This

will allow us to see the plot with and without the cluster-level covariate. Figure 2 displays

the screen that appears.

22LR

Figure 2. MSCRT - Power vs. cluster size.

92

Clicking on the solid trajectory reveals that with 17 students per classroom, the power is

0.80. However, if we include the cluster-level covariate, we only need 9 students per

classroom to achieve power = 0.80.

Note that the power does not approach 1.0 for either of these trajectories because

power does not approach 1.0 as the sample size per cluster is increased. Additional effect

sizes, effect size variability, and intra-class correlations could also be specified to look at

a variety of trajectories on one screen.

7.2.3 Scenario 2

Based on past studies, the researchers estimate 15.0=ρ , , and want to

be able to detect a minimum standardized effect size of 0.20. Assuming 10 classrooms

within each school are willing to participate and 25 students within each classroom, how

many schools are necessary to achieve power = 0.80? What if the researchers include a

cluster-level covariate that explains 49% of the variation in the cluster level mean. How

many schools are necessary to achieve power = 0.80?

01.02 =δσ

In Scenario 2, the number of sites is unknown thus we select the power vs.

number of sites (K) option. This will allow the number of sites to vary along the x-axis.

As a result, K will be replaced on the toolbar by the cluster size (n) icon. The remaining

options function as previously described. Now let’s use the software to revisit Scenario 2.

To answer the question, click on Power vs. number of sites (K). Then move along

the toolbar and specify n=25, J=10, ,20.0=δ ,01.02 =δσ 15.0=ρ , B = 0.40, and

=0.00 and 0.49. Figure 3 displays the screen that appears. 22LR

Figure 3. MSCRT - Power vs. number of sites.

93

Clicking along the solid trajectory reveals that 14 sites, or schools, are necessary to

achieve power = 0.80. When the cluster level covariate is included in the design, clicking

along the dotted line trajectory reveals that only 10 schools are necessary to achieve

power = 0.80. Note that unlike the cluster size, as the number of sites increases, the

power tends towards 1. In other words, the number of sites is very important for

increasing the power. This corresponds to the information presented in Chapter 5. Note

that without the covariate, 14 x 10 = 140 classrooms are necessary and with the covariate

10 x 10 = 100 classrooms are necessary to achieve power = 0.80.

7.2.4 Scenario 3

Based on past studies, the researchers estimate 15.0=ρ , , and want to

be able to detect a minimum standardized effect size of 0.20. Assuming 15 schools are

willing to participate as well as 25 students per classroom, how many classrooms are

necessary to achieve power = 0.80? What if the researchers include a cluster-level

covariate that explains 49% of the variation in the cluster level mean. How many

classrooms are necessary to achieve power = 0.80?

01.02 =δσ

In Scenario 3, the number of clusters, or classrooms, is unknown thus we select

the power vs. number of clusters (J) option. Again, the only change in the toolbar is that J

94

no longer appears since it is allowed to vary along the x-axis. Now let’s explore Scenario

3 using the software.

To answer the question, select Power vs. number of clusters (J) from the menu.

Then move along the toolbar and specify n=25, K=15, ,20.0=δ , 01.02 =δσ 15.0=ρ ,

B=0.40, and = 0.00 and 0.49. Figure 4 displays the new screen with the x-axis

adjusted to min=2.0 and max = 20.0.

22LR

Figure 4. MSCRT - Power vs. number of clusters per site.

Clicking along the solid trajectory reveals that 9 classrooms per school will achieve the

desired power of 0.80. When the cluster-level covariate is included, only 6 classrooms

per school are necessary. Thus in the case of no covariate, a total of K x J, or 15 x 9 =

135 classrooms are necessary to achieve the desired power. A total of 15 x 6 = 90

classrooms are necessary when the cluster level covariate is included.

7.2.5 Scenario 4

Based on past studies, the researchers estimate and want to be able to

detect a minimum standardized effect size of 0.20. Assuming 15 schools are willing to

01.02 =δσ

95

participate, 10 classrooms within each school, and 25 students per classroom, what value

of the intra-class correlation results in power = 0.80? What if the researchers include a

cluster-level covariate that explains 49% of the variation in the cluster level mean. What

value of the intra-class correlation results in power = 0.80?

In Scenario 4, the value of the intra-class correlation is unknown thus we select

power vs. intra-class correlation ( ρ ). Again, the only change in the toolbar is that ρ no

longer appears since it is allowed to vary along the x-axis. Now let’s explore Scenario 4

using the software.

To answer the question, click power vs. intra-class correlation ( ρ ). Then move

along the toolbar and specify n=25, K=15, J=10, ,20.0=δ 01.02 =δσ , B = 0.40, and =

0.00 and 0.49. Figure 5 displays the new screen.

22LR

Figure 5. MSCRT - Power vs. intra-class correlation.

Clicking along the solid trajectory reveals that ρ of approximately 0.15 results in power

= 0.80. Clicking along the dotted trajectory reveals that ρ of approximately 0.25

achieves power = 0.80. Note that for any single trajectory, as the intra-class correlation

increases, the power of the test decreases.

96

7.2.6 Scenario 5

Based on past studies, the researchers estimate 15.0=ρ and . Assuming

15 schools are willing to participate, 10 classrooms within each school, and 25 students

per classroom, what is the minimum detectable effect size that results in power = 0.80?

What if the researchers include a cluster-level covariate that explains 49% of the

variation in the cluster level mean. What is the minimum detectable effect size that

results in power = 0.80?

01.02 =δσ

In Scenario 5, the effect size is unknown thus we select the power vs. effect size

option. Again, the only change in the toolbar is that δ no longer appears since it is

allowed to vary along the x-axis. Now let’s explore Scenario 5 using the software.

To answer the question, click on power vs. effect size. Then move along the

toolbar and specify n=25, K=15, J=10, , 01.02 =δσ 15.0=ρ , B = 0.40, and = 0.00 and

0.49. Figure 6 displays the new screen.

22LR

Figure 6. MSCRT - Power vs. effect size.

Clicking along the solid trajectory reveals that an effect size of approximately 0.19 results

in power = 0.80. Clicking along the dotted trajectory reveals that by including a cluster-

97

level covariate that explains 49% of the variation in the cluster level outcome, we can

detect an effect size of approximately 0.16.

7.2.7 Scenario 6

Based on past studies, the researchers estimate 15.0=ρ and want to be able to

detect a minimum standardized effect size of 0.20. Assuming 15 schools are willing to

participate, 10 classrooms within each school, and 25 students per classroom, what is the

minimum effect size variability that results in power = 0.80? What if the researchers

include a cluster-level covariate that explains 49% of the variation in the cluster level

mean. What is the minimum effect size variability that results in power = 0.80?

In Scenario 6, the effect size variability is unknown thus we select the power vs.

effect size variability ( ) option. Again, the only change in the toolbar is that no

longer appears since it is allowed to vary along the x-axis. Now let’s explore Scenario 6

using the software.

2δσ 2

δσ

To answer the question, click power vs. effect size variability ( ). Then move

along the toolbar and specify n=25, K=15, J=10,

2δσ

,20.0=δ 15.0=ρ , B = 0.40, and =

0.00 and 0.49. Figure 7 displays the new screen.

22LR

Figure 7. MSCRT - Power vs. effect size variability.

98

Clicking on the solid trajectory, we can achieve power = 0.80 with an effect size

variability of 0.016. However, if we include the cluster level covariate, we can achieve

power = 0.80 with an effect size variability of 0.038.

7.2.8 Scenario 7

Based on past studies, the researchers estimate 15.0=ρ and and want

to be able to detect a minimum standardized effect size of 0.20. Assume 15 schools are

willing to participate, 10 classrooms within each school, and 25 students per classroom.

What proportion of explained variation by the cluster-level covariate results in power =

0.80?

01.02 =δσ

In Scenario 7, a cluster level covariate is available, but the proportion of

explained variation by the covariate is unknown. Thus, we select the power vs. proportion

of explained variation by level 2 covariate option. Again, the only change in the toolbar is

that no longer appears since it is allowed to vary along the x-axis. Now let’s explore

Scenario 7 using the software.

22LR

99

To answer the question, click power vs. cluster level covariate correlation. Then

move along the toolbar and specify n=25, J=10, K=15, ,20.0=δ ,01.02 =δσ 15.0=ρ , and B

= 0.40. Figure 8 displays the new screen.

Figure 8. MSCRT - Power vs. cluster level covariate correlation.

We can see that even without the cluster level covariate, the design has power = 0.80.

Inclusion of the cluster level covariate increases the power to greater then 0.80.

7.3 Power for effect size variability

Thus far we have focused on power calculations for the treatment effect in a

random effects model. Researchers may also be interested in the power for effect size

variability. Recall that if the effect size variability is large, the treatment effect may be

meaningless and it is important to investigate moderating effects to explain the variability

in effect sizes. As a result, it is important to be able to detect the effect size variability

with adequate power. Similar to the power for detecting the treatment effect, the power to

detect effect size variability is a function of the cluster size, n, the number of clusters, J,

the number of sites, K, and the intra-class correlation, .ρ The main difference is that the

effect size, ,δ does not impact the power to detect treatment variability. Section 7.3.1

100

provides an example that will be used to illustrate how to use the software to calculate the

power for the treatment effect.

7.3.1 Example

The example is a continuation of the example in Section 7.2.1. Recall that there is

a team of researchers who develop a new literacy program. The founders of the new

program propose that students who participate in the program will have increased reading

achievement. They propose a three level design with students nested within classrooms

nested within school. They expect blocking by school will explain 40% of the variability

in the outcome. They plan to test students who are in classrooms that participate in the

regular program (control group) and students who are in classrooms that participate in the

new program (experimental group) in each of the participating schools using a reading

test. In addition to determining the power for the treatment effect, they also want to know

the power to detect the variability in the effect sizes across sites. Five scenarios the

researchers might encounter are presented below. Assume 05.0=α for each case.

7.3.2 Scenario 1

Based on past studies, the researchers estimate 15.0=ρ Assuming 15 schools are

willing to participate as well as 10 classrooms per school, how many students within each

classroom are necessary to detect an effect size variability of 0.10 with power = 0.80?

In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size

(n) option. The toolbar at the top of the screen is the same as the toolbar described in

Section 6.2.2, except there is noδ or effect size button. Thus detailed descriptions of the

buttons are not provided in this section.

To answer the question, click on Power vs. cluster sized (n). Then move along the

toolbar and specify K=15, J=10, ,10.02 =δσ 15.0=ρ , and B = 0.40. Figure 9 displays the

new screen.

Figure 9. Power vs. cluster size.

101

Clicking along the trajectory reveals that 14 students per classroom are required to

achieve power = 0.80.

7.3.3 Scenario 2

Based on past studies, the researchers estimate 15.0=ρ Assuming 15 classrooms

within each school are willing to participate as well as 25 students per classroom, how

many schools are necessary to detect an effect size variability of 0.10 with power = 0.80?

In Scenario 2, the number of sites is unknown thus we select the power vs.

number of sites (K) option. This allows the number of sites to vary along the x-axis.

To answer the question, click on Power vs. number of sites (K). Then move along

the toolbar and specify n=25, J=10, ,10.02 =δσ 15.0=ρ , and B = 0.40. Figure 10 displays

the result.

Figure 10. Power vs. number of sites.

102

Clicking along the trajectory reveals that 12 schools are necessary to detect an effect size

variability of 0.10 with power = 0.80. Note that as the number of sites increases, the

power increases.

7.3.4 Scenario 3

Based on past studies, the researchers estimate 15.0=ρ Assuming 15 schools are

willing to participate as well as 25 students within each classroom, how many classrooms

are necessary to detect an effect size variability of 0.10 with power = 0.80?

In Scenario 3, the number of clusters in unknown thus we select the power vs.

number of clusters (J) option. This allows the number of clusters to vary along the x-axis.

To answer the question, click on Power vs. number of clusters (J). Then move

along the toolbar and specify n=25, K=15, ,10.02 =δσ 15.0=ρ , and B = 0.40. Figure 11

displays the results.

Figure 11. Power vs. number of clusters per site.

103

Clicking along the trajectory we can see that 8 classrooms per school are necessary to

achieve power = 0.80 to detect an effect size variability = 0.10. Note that the power to

detect the treatment effect increases rapidly towards 1 as the number of clusters per site

increases.

7.3.5 Scenario 4

Assume the researchers have secured 15 schools as well as 10 classrooms per

school and 25 students per classroom. What value of the intra-class correlation is

necessary to detect an effect size variability of 0.10 with power = 0.80?

In Scenario 4, the value of the intra-class correlation is unknown thus we select

the power vs. intra-class correlation ( ).ρ This allows ρ to vary along the x-axis.

To answer the question, click on Power vs. intra-class correlation ( ρ ). Then

move along the toolbar and specify n=25, J=10, K=15, and Figure 13 displays

the results.

.10.02 =δσ

Figure 13. Power vs. intra-class correlation.

104

Clicking along the trajectory reveals 11.0=ρ is necessary to detect effect size variability

of 0.10 with power = 0.80. Note that as the intra-class correlation increases, the power to

detect the treatment effect variability decreases.

7.3.6 Scenario 5

Based on past studies, the researchers estimate 15.0=ρ Assuming 15 schools are

willing to participate with 10 classrooms per school and 25 students per classroom. What

is the minimum effect size variability that results in power = 0.80?

In Scenario 5, the effect size variability is unknown thus we select power vs.

effect size variability ( This allows the effect size variability to vary along the x-).2δσ

axis.

ar and specify n=25, J=10, K=15,

To answer the question, click on Power vs. effect size variability ( ).2δσ Then

move along the toolb 15.0=ρ , and B

isplays the results.

Figure 14. Power vs. effect size variability.

=0.40. Figure 14

d

105

Clicking along the trajectory reveals that a minimum effect size variability of 0.087 can

be detected with power = 0.80. Note that as the effect size variability increases, the power

tends towards 1. Intuitively this makes sense. As there is more and more variation among

the effect sizes, it is easier to detect. Note that this is the opposite of what occurs in the

power for the treatment effect. In that test, as the effect size variability increases, the

power to detect the treatment effect decreases. This makes sense because it becomes

more difficult to detect the effect if there is a lot of variation.

In general, smaller sample sizes are required in order to achieve adequate power

to detect the variance in the treatment effect than the main effect of treatment. Thus the

primary focus of the researcher should be to design a study that has good power to detect

the main effect of treatment. Then the researcher should investigate the power to detect

the treatment effect variability since this has less stringent requirements than the first test.

7.4 Power for the Average Treatment Effect (Fixed Effects Model)

We can also calculate the power for the treatment effect in a fixed effects model.

In this case, the power for the treatment effect is a function of the cluster size, n, the

106

number of clusters, J, the number of sites, K, the intra-class correlation, ρ , the effect

size, δ , the percentage of variance explained by the blocking variable, B, and the cluster-

level covariate correlation, denoted in the program. In the fixed effects case, the

effect size variability, , is set equal to 0. Using the OD program to calculate the power

for the average treatment effect in a fixed effects model is the same as in the random

effects model discussed in Section 6.2 except that we specify =0. Only one example

for the fixed effects model is provided below since the directions in Section 6.2 can easily

be modified for a fixed effects model by specifying =0.

22LR

2δσ

2δσ

2δσ

Suppose researchers want to test the effect of a new math program designed for

students’ grades 1-12. Based on past studies, the researchers know that the school type,

elementary, middle, or high school, is related to the outcome, math achievement.

Research indicates that blocking on school type reduces the between-school variation by

70%. Research also indicates that the between cluster (school) variation prior to blocking

is 0.15. Within each school, the researchers test 100 students. What is the minimum

detectable effect the researchers can detect with 10 schools per site?

To answer the question, click on click on Power vs. effect size (delta). Then move

along the toolbar and specify n=100, J=10, K=3, ,00.02 =δσ 15.0=ρ , and B=0.70. Figure

15 displays the result.

107

Figure 15. MSCRT - Power vs. effect size.

A note appears on the screen indicating that this is a fixed effects model because the

effect size variability is set to 0. Clicking along the trajectory reveals that an effect size of

approximately 0.25 can be detected for power = 0.80.

108

8. Three Level Models with Randomization at Level Three

In Chapter 7, we discussed multi-site cluster randomized trials, or designs that

include three levels, where level three is a site, or block. For example, in a multi-site

cluster randomized trial, we might have students nested within classrooms within schools

where schools function as blocks. In this case, the randomization occurs at level 2, or the

classroom-level. In this chapter, we again consider three levels of data. However, in the

three level trial discussed in this chapter, the randomization occurs at level 3, or at the

school level. Chapter 8 provides a conceptual background for the three level model with

treatment at level 3, including the use of covariates at level 3. Chapter 9 describes how to

use the Optimal Design software to design a three level study with adequate power to

detect the treatment effect.

8.1 General Description of the Three Level Model

A three level trial with randomization at level 3 is a commonly used design. For

example, imagine an evaluation for a new elementary math program. Schools are

randomly assigned to either the new program or their regular program. Within each

school, all the classrooms adopt the new program. Thus, we have students within

classrooms within schools, where schools are assigned at random to treatment or control.

In order to calculate the power for this design, we need to account for the variability at

the child, classroom, and school levels. A common mistake is to simplify this trial to a

cluster randomized trial and ignore the classroom level. However, the variability in the

teachers might mean that students in classroom A might react differently to the program

than students in classroom B. In addition, students within a classroom might be more

similar to each other. We need to account for the classroom or teacher level variability so

that we do not overestimate the precision of the estimate and the power of the test.

This chapter focuses on how to calculate the power for a three level design under

two different conditions. First, we discuss power considerations when there is no level 3

covariate. Second, we look at the power for a design with a level 3 covariate.

8.2 The Three Level Model With No Covariates

Suppose a team of researchers is interested in the effect of a new comprehensive

school reform (CSR) on math outcomes. The CSR is implemented at the school level.

Schools are randomly assigned to either the CSR or their regular teaching methods.

109

Within each school, students are nested within classrooms. To account for the nested

structure of the data, the researchers use a three level model. Suppose there are 40

schools participating in the experiment, 20 treatment schools and 20 control schools.

Within each school there are 8 classes and 25 students within each class. The researchers

want to detect an effect size of 0.25. Assume that 20% of the variation lies between

classrooms and 10% of the variation lies between schools. What is the power of the test

to detect the treatment effect based on the above constraints?

Entering the information into the OD software reveals that the power to detect the

treatment effect is 0.58. The trajectory for this case is displayed in Figure 1.

Figure 1. Three level model – Power vs. number of sites.

Number of sites

Power

15 26 37 48 59

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 25 J = 8 τπ= 0.200 τβ = 0.100

δ= 0.25

Note that as the number of schools increases, the power to detect an effect also increases.

Let’s look at the model to help us understand the components of power in a three level

design.

8.2.1 The Model

We can represent the data from this design as persons nested within clusters

nested within sites. The level 1, or person-level model is:

ijkjkijk eY += 0π (1) ),0(~ 2σNeijk

110

where persons per cluster ni ,...,1=

j=1,…,J clusters per site

k=1,…,K sites

jk0π is the mean for cluster j in site k

is the error associated with each person ijke

is the within-cluster variance. 2σ

The level-2 model, or cluster-level model, is:

jkkjk r0000 += βπ ),0(~0 πτNr jk (2)

where k00β is the mean for site k

jkr0 is the random effect associated with each cluster

πτ is the variance between clusters within sites.

The level-3 model, or site-level model, is:

kkk uW 0000100000 ++= γγβ (3) ),0(~0000 k

Nu k βτ

where 000γ is the estimated grand mean

001γ is the treatment effect (“main effect of treatment”)

is 0.5 for treatment and –0.5 for control kW

ku00 is the random effect associated with each site mean

k00βτ is the residual variance between site means.

Note that unlike the multi-site cluster randomized trial, the randomization in this design

occurs at level 3.

8.2.2 Testing the Main Effect of Treatment

In the model above, the treatment effect is estimated at level 3 and is denoted

001γ . Given a balanced design, it is estimated by:

(4) CE YY__

001

^−=γ

where is the mean for the experimental group EY_

111

is the mean for the control group. CY_

Because of the nested structure of the data, we sum over clusters and sites in order to

estimate the treatment effect. The variance of the estimated treatment effect combines the

variance at all three levels, the variance between-site means, , the within-site or

between-cluster variance,

00βτ

πτ , and the within-cluster or between-person variance, .

Note that unlike a multi-site cluster randomized trial, there is no estimated variance

component for the between-site variance in the treatment effect,

.11βτ This difference is

easy to see by comparing the models for the two designs. There is no in the model for

the three level design. Conceptually, the difference exists because in the multi-site cluster

randomized trial, we have mini-experiments at each site which allow us to estimate K

treatment effects and to calculate the between-site variability of the treatment effect.

However, in the three level design, the treatment is applied at level 3 so we are only able

to estimate one treatment effect. The variance of the treatment effect is estimated by:

11βτ

K

JnVar k

]/)/([4)(

2

001

^00

σττγ πβ ++

= (5)

If the data are balanced, we can use the results of a nested analysis of variance

with random effects for the clusters and sites and a fixed effect for the treatment. Similar

to prior tests, the test statistic is an F statistic. The F test follows a non-central F

distribution, F(1, K-2; )λ . Recall that the noncentrality parameter,λ , is a ratio of the

squared-treatment effect to the variance of the treatment effect estimate. Below is the

noncentrality parameter for the test.

)( 001

^

2001

γ

γλ

Var=

]/)/([4 2

2001

00Jn

K

kσττ

γ

πβ ++= (6)

Recall that increasing the noncentrality parameter increases the power to detect

the treatment effect. Let’s examine how the researcher can increase the noncentrality

parameter to increase the power of the test. Because this model assumes no covariates,

we cannot reduce any of the variance components so , k00βτ ,πτ and are not under the

control of the researcher. The only remaining pieces of the noncentrality parameter are

the sample size and the size of the treatment effect. The size of the treatment effect is

112

often based on theory, past studies, or a pilot study which means the researcher cannot

inflate the size of the treatment effect to increase power without decreasing the

theoretical or practical conclusions of the study. Thus increasing the sample size is the

only option for increasing the power. From equation 6, we can see that increasing the

number of sites, K, is the most effective strategy to increase the power, followed by the

number of clusters, J, and finally the number of persons per cluster, n.

8.2.3 The Standardized Model with No Covariates

Thus far we have focused on the unstandardized model. However, as previously

mentioned, researchers typically discuss standardized effect sizes. We continue to utilize

Cohen’s definition for standardized effect sizes, and adopt 0.20, 0.50, and 0.80 as small,

medium, and large effect sizes.

In the standardized model, we set the sum of the within-cluster variance, the

between-cluster variance,

,2σ

,πτ and the between-site variance for the site means, k00βτ ,

equal to 1. Since we use three components of variance to standardize the model, we have

two intra-class correlations, 2levelρ and 3levelρ . The first intra-class correlation, 2levelρ ,

corresponds to the between-cluster variance relative to the total variance,

2200

σττ

τρ

βπ

π

++=

k

level . The second intra-class correlation, 3levelρ , is the between-site

variance relative to the total between and within site variance, 2σ

300

τ

τ

β +=

k

level00

τ π

β kρ+

.

In standardized notation, the non-centrality parameter, λ , can be rewritten as:

}/]/)1([{4 3223

2

JnK

levellevellevellevel ρρρρδ

λ−−++

= (7)

whereδ is the standardized main effect of treatment, 2

001

σττ

γδ

πβ ++= .

Because 2levelρ and 3levelρ are often unknown quantities and can be difficult to estimate,

the Optimal Design program asks the user for estimates of the proportion of variance at

level 1, level 2, and level 3 where the sum of three variances is constrained to equal 1.

The user should be able to more easily estimate these values.

113

8.3 The Three Level Model with a Site Level Covariate

Often a site-level covariate may be available to the researcher. The researchers

can use this information to reduce the level-3 variability, or the between-site variance. As

noted in Section 9.2.2, reducing the site-level variability can help increase the power of

the test. Because a site-level covariate is measured at level 3, it only effects the variability

at level 3, or . In other words, including a site-level covariate will not effect the

between-cluster variability,

k00βτ

πτ or the within-cluster variability, . We use S to denote a

site-level covariate in the model. The proportion of variance explained by the site-level

covariate is defined as .

2sβρ

00k

Let’s modify the example provided in section 8.2 to include a site-level covariate.

Recall that a team of researchers is investigating the effects of a CSR on math outcomes.

Suppose that school means on last years state math test are available to the researcher.

Suppose this site-level covariate reduces the level 3 variance by 64%. Recall that 20% of

the variation lies between classrooms, 10% of the variation in the outcome lies between

schools, and 70% of the variation lies within classrooms. The site-level covariate reduces

the between school variance to (1-.64)*(0.10)=0.036, meaning that only 3.6% of the

variation between sites is unexplained. Recall that there are 40 schools participating in

the experiment, 20 treatment schools and 20 control schools. Within each school there are

8 classes and 25 students within each class. Under the original conditions, with no site-

level covariate, the power to detect the main effect of treatment without a covariate was

0.58. Using the OD software, we can see that including the site-level covariate results in

power equal to 0.86. Figure 2 displays the results.

114

Figure 2. Three level model – Power vs. number of sites.

Number of sites

Power

15 26 37 48 59

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 n = 25 J = 8 τπ= 0.200 τβ = 0.100

δ= 0.25δ= 0.25,R2

L3= 0.64

Let’s look at the model for the design with a site-level covariate to see how the

site-level covariate effects the power of the test.

8.3.1 The Model

Levels 1 and 2 of the model with a site-level covariate are identical to the level 1

and 2 equations (equations 1 and 2) for the case with no covariate. This is because

inclusion of a site-level covariate does not effect the variability in the lower levels in the

model. The new level 3, or site level model is:

kkkk uSW 0000200100000 +++= γγγβ ),0(~ |00 00 sk Nu βτ (8)

Note: kkk ss 000000

)1( 2| βββ τρτ −=

where 000γ is the estimated grand mean

001γ is the treatment effect (“main effect of treatment”)

002γ is the regression coefficient for the level 3 covariate

kW is 0.5 for treatment and –0.5 for control

kS is the level 3 covariate

ku00 is the random effect associated with each site mean

115

s|00βτ is the residual variance between site means conditional on the site-level

covariate

Note that the level 3 variance is adjusted for the covariate. The smaller variance will

increase the precision of the estimate thus increasing the power of the test.

Given a balanced design, the main effect of treatment is estimated as the

difference in the treatment and control groups adjusted for the site-level covariate:

)(__

002

^__

001

^

CECE SSYY −−−= γγ . (9)

The variance of the treatment effect is:

KJn

SVar sk]/)/([4

)|(2

|001

^00

σττγ πβ ++

= . (10)

Note that only the between-site variance, , is adjusted for inclusion of the covariate

since it is at the covariate is at the site level.

k00βτ

Similar to the case with no covariate, to test the main effect of treatment we use

an F-statistic which follows a non-central F distribution, F(1, K-3, sλ ) where:

]/)/([4 2|

2001

00Jn

K

ss

kσττ

γλπβ ++

= . (11)

The noncentrality parameter for the test for the main effect of treatment looks similar to

equation 6, the case with no covariate, except that the level 3 variance and the estimate of

the treatment effect are adjusted for the cluster level covariate. Note that reducing the

variability at level 3 gives the researcher another tool for increasing the noncentrality

parameter and increasing the power. In cases when the between-site variance accounts for

a high proportion of the variance, finding a site-level covariate that is highly correlated

with the site-level outcome can be very beneficial. It may also help reduce the number of

sites necessary to achieve a specified power, which can reduce the cost of the study.

116

8.3.2 The Standardized Model with a Site-Level Covariate

Following the same logic as the three level model with no covariates, it is

important to standardize the model. The noncentrality parameter expressed in

standardized notation is:

}/]/)1([{4 *3

*2

*2

*3

2*

JnK

levellevellevellevels ρρρρ

δλ−−++

=

where

is the intra-class correlation *2levelρ

2|

*2

00σττ

τρ

βπ

π

++=

slevel

k

, or the proportion of

variance among clusters relative to the total variation conditional on the level-3 covariate.

is the intra-class correlation *3levelρ 2

|

|*3

00

00

στττ

ρπβ

β

++=

s

slevel

k

k , or the proportion of

variance among sites relative to the total variation conditional on the level-3 covariate. *δ is the standardized main effect of treatment conditional on the level-3

covariate,2

|

*

00σττ

δδπβ ++

=sk

.

Because the conditional standardized quantities, , , and , are

frequently unknown, the program asks the user to enter the unconditional parameters. The

program calculates the conditional standardized values based on the value the user

specifies for the percent of variance reduction at level 3, .

*2levelρ

23levelR

*3levelρ *δ

117

9. Using the Optimal Design Software for the Three Level Model with Treatment at

Level Three

This chapter focuses on how to use the OD software to design a three level trial

with treatment at level three. Section 9.1 provides general information for how to use the

three level model with treatment at level three. Section 9.2 provides an example and

details regarding the options within the three level design option.

9.1 General Information

The three level trial with treatment at level three option allows the researcher to

calculate the power for the average treatment effect as a function of the cluster size, the

number of clusters per site, the number of sites, the effect size, the level-2 variability, the

level-3 variability, and the proportion of variance explained by the site level covariate.

The menu is below:

Three Level Model with Treatment at Level 3

Power vs. cluster size (n)

Power vs. number of clusters per site (J)

Power vs. number of sites (K)

Power vs. effect size (delta)

Power vs. proportion of variance reduction at level 3 (R2)

9.2.1 Example

A team of researchers is designing a study to determine if a particular whole

school reform model improves academic achievement. The design consists of students

nested within classrooms nested within schools, which naturally lends itself to a three

level model. The reform effort is implemented at the level of the school, so the most

appropriate design is a three level design with treatment at level three. The researchers

plan to test students who are in the schools that participate in the new school reform

model (experimental group) and students who are in the schools that participate in the

regular model (control group) to determine if students in the new school reform model

score higher on an academic assessment. The researchers are unsure how to proceed with

respect to the number of students they should test in each classroom, the number of

classrooms in each school, and the number of schools in order to conduct a trial with

118

power = 0.80. Five scenarios the researchers might encounter are presented below.

Assume 05.0=α for each case.

9.2.2 Scenario 1

Based on past studies, the researchers estimate that 10% of the variability is at the

classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving

80% of the variability between students. A minimum standardized effect size of 0.25 is

desired. Assuming 40 schools are willing to participate as well as 8 classrooms within

each school, how many students within each classroom are necessary to achieve power =

0.80? What if the researchers include a school-level covariate that explains 49% of the

variation in the school level mean. How many students per classroom are required to

achieve power = 0.80?

In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size

(n) option. Figure 1 displays the screen.

Figure 1. 3 Level Model Screen.

The buttons in the toolbar are explained below.

119

α - specifies the significance level, or chance of a Type I error. By default, α is set at

0.05, which is a common level for most designs.

K – specifies the number of sites. By default, K is set at 30.

J – specifies the number of clusters within each site. By default, J is set at 6.

δ - specifies the minimum effect size of interest. By default, the minimum effect size is

set at 0.20.

Set – specifies the proportion of variability at level-2 and level-3. The total variability for

levels 1,2, and 3 is 1. After entering the proportion of variability at levels 2 and 3, click

on compute sigma. The default settings are πτ (level-2 variability) = 0.10 and βτ (level-3

variability) = 0.10 so (level-1 variability) = 0.80. 2σ

23LR - specifies the proportion of explained variance in the level 3 mean outcome by the

level 3 covariate. By default, is set to 0. 23LR

The remaining options in the toolbar are the same as those in the other modules. The

details can be found in Chapter 3.

Follow the steps below to answer the questions:

Step 1: Click on power vs. cluster size (n).

Step 2: Click on J on the toolbar and set J = 8. A graph will appear but by looking at the

key we can see it does not match the specific settings for this example.

Step 3: Click on K and set K = 40.

Step 4: Click on δ and set δ = 0.25.

Step 5: Click on . Leave (1) equal to 0. Set equal to 0.49. Figure 2

displays the screen.

23LR 2

3LevelR 23LevelR

Figure 2. Three Level Model – Power vs. cluster size.

120

Note that two trajectories appear on the screen. Without the covariate, we cannot achieve

power = 0.80. However, clicking along the dotted trajectory it is clear we can see that

including the covariate allows us to achieve power = 0.80 with only 9 students per

classroom. The level-3 covariate helps increase the power and is usually rather

inexpensive to collect since it is a site-level characteristic which in the case of schools

can usually be found in a central database.

9.2.3 Scenario 2

Based on past studies, the researchers estimate that 10% of the variability is at the

classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving

80% of the variability between students. A minimum standardized effect size of 0.25 is

desired. Assuming 40 schools are willing to participate and there is an average of 30

students per classroom, how many classrooms are necessary to achieve power = 0.80?

What if the researchers include a school-level covariate that explains 49% of the variation

in the school level mean. How many classrooms are required to achieve power = 0.80?

In Scenario 2, the number of clusters is unknown thus we select the power vs.

number of clusters per site (J) option.

121

Follow the steps below to answer the questions:

Step 1: Click on power vs. number of cluster per site (J).

Step 2: Click on non the toolbar and set n = 30. A graph will appear but by looking at the

key we can see it does not match the specific settings for this example.

Step 3: Click on K and set K = 40.

Step 4: Click on δ and set δ = 0.25.

Step 5: Click on . Leave (1) equal to 0. Set equal to 0.49. Figure 3

displays the screen.

23LR 2

3LevelR 23LevelR

Figure 3. Three Level Model – Power vs. number of clusters per site.

Note that like the cluster size, increasing the number of clusters per site does not result in

power = 1.0. It is clear that without a covariate, we cannot achieve power = 0.80.

Including the covariate results in power = 0.80 for 5 clusters per site. Let’s see what

happens when we allow the number of sites to vary.

9.2.4 Scenario 3

Based on past studies, the researchers estimate that 10% of the variability is at the

classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving

80% of the variability between students. A minimum standardized effect size of 0.25 is

122

desired. Assuming there is an average of 8 classrooms within each school that are willing

to participate and 30 students per classroom, how many schools are necessary to achieve

power = 0.80? What if the researchers include a school-level covariate that explains 49%

of the variation in the school level mean. How many schools are required to achieve

power = 0.80?

In Scenario 3, the number of sites is unknown thus we select the power vs.

number of sites (K) option.

Follow the steps below to answer the questions:

Step 1: Click on power vs. number of sites (K).

Step 2: Click on n on the toolbar and set n = 30. A graph will appear but by looking at the

key we can see it does not match the specific settings for this example.

Step 3: Click on J and set J = 8.

Step 4: Click on δ and set δ = 0.25.

Step 5: Click on . Leave (1) equal to 0. Set equal to 0.49. Figure 4

displays the screen.

23LR 2

3LevelR 23LevelR

Figure 4. Three Level Model – Power vs. number of sites.

123

Note that as the number of sites increases, the power goes to 1. Clicking along the

trajectory, we can see that 60 sites achieves power = 0.80 in the case of no covariate.

Including the covariate, reduces the number of necessary sites to 36, which is much more

reasonable.

9.2.5 Scenario 4

Based on past studies, the researchers estimate that 10% of the variability is at the

classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving

80% of the variability between students. A minimum standardized effect size of 0.25 is

desired. Assume 40 schools are willing to participate, 8 classrooms within each school,

and 30 students within each class. What is the minimum detectable standardized effect

for power = 0.80? What if the researchers include a school-level covariate that explains

49% of the variation in the school level mean. What is the minimum detectable effect that

achieves power = 0.80?

In Scenario 4, the minimum detectable effect is unknown thus we select the power

vs. effect size option.

Follow the steps below to answer the questions:

Step 1: Click on power vs. effect size (delta).

Step 2: Click on n on the toolbar and set n = 30. A graph will appear but by looking at the

key we can see it does not match the specific settings for this example.

Step 3: Click on J and set J = 8.

Step 4: Click on K and set K = 40.

Step 5: Click on . Leave (1) equal to 0. Set equal to 0.49. Figure 5

displays the screen.

23LR 2

3LevelR 23LevelR

124

Figure 5. Three Level Model – Power vs. effect size.

Clicking along the trajectory reveals a minimum detectable effect of 0.30 in the case of

no covariate and 0.24 in the case of the covariate for power = 0.80.

9.2.6 Scenario 5

Based on past studies, the researchers estimate that 10% of the variability is at the

classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving

80% of the variability between students. A minimum standardized effect size of 0.25 is

desired. Assume 40 schools are willing to participate, 8 classrooms within each school,

and 30 students within each classroom. What proportion of the variability in the level 3

outcome does the level 3 covariate need to explain in order to achieve power = 0.80?

In Scenario 5, the proportion of variance reduction as a result of the level 3

covariate is unknown thus we select the power vs. proportion of variance reduction at

level 3 (R2) option.

Follow the steps below to answer the questions:

Step 1: Click on power vs. proportion of variance reduction at level 3 (R2).

Step 2: Click on n on the toolbar and set n = 30. A graph will appear but by looking at the

key we can see it does not match the specific settings for this example.

125

Step 3: Click on J and set J = 8.

Step 4: Click on K and set K = 40.

Step 5: Click on δ and set δ = 0.25. Figure 6 displays the screen.

Figure 6. Three Level Model – Power vs. proportion of variance reduction at level 3.

Clicking along the trajectory reveals power = 0.80 can be achieved if 43% of the

variation in the level 3 mean outcome is explained by the level 3 covariate. Note that as

the proportion of explained variation increases towards 1, the power also increases

towards 1.

126

10. Repeated Measures in Cluster Randomized Trials

Chapters 10 and 11 explore longitudinal research designs. In a longitudinal study,

or repeated measures design, people are followed over time and observed on several

occasions. Chapter 10 explores the conceptual framework surrounding repeated measures

in cluster randomized trials and Chapter 11 describes how to use the Optimal Design

software to design a cluster randomized trial with repeated measures.

10.1 Why repeated measures?

In a typical longitudinal study, observations are recorded prior to treatment, often

referred to as the baseline measurement, and then after the treatment a pre-determined

number of times. Measuring participants prior to treatment and post-treatment allows the

researchers to assess individual growth. Individual growth may be plotted via a straight

line or a curvilinear trajectory. A linear trajectory, or first degree polynomial, is

characterized by an intercept and a linear rate of change, or slope. Curvilinear trajectories

are second, third, or higher degree polynomials. A second degree polynomial, also known

as a quadratic polynomial, adds an acceleration parameter to the intercept and rate of

change. A third degree polynomial, also known as a cubic polynomial, is characterized by

4 parameters, change in acceleration, rate of acceleration, linear rate of change, and an

intercept.

In a simple repeated measures design, individuals are repeatedly observed and

individual trajectories are plotted to assess average treatment effects on a specific

polynomial change parameter. In this chapter we extend the simple design to settings in

which individuals are nested within clusters and treatment is applied at the cluster level.

This allows us to assess the average difference in the polynomial change parameter for

those in the treatment group and those in the control group, accounting for the cluster

effect.

To illustrate, imagine that a group of researchers develop a new phonics program

for first graders. The program is an intense year-long program. Students are assessed at

the beginning of the year, prior to treatment, and five times throughout the year. 40

classrooms have been randomly selected to participate in the study, 20 in the treatment

group and 20 in the control group. Each classroom has 25 students and all 25 will

participate in the study. Since we have repeated measures on students who are nested

127

within classrooms, we must treat the design as a cluster randomized trial with repeated

measures in order to determine the power of the test correctly. If we ignore the clusters,

the estimate of the variance of the treatment effect and the power calculations will not be

correct.

The power to detect the main effect of treatment in a repeated measure cluster

randomized trial is more complicated than in a cluster randomized trial because we need

to take the repeated measures on each person into consideration. However, in this chapter

we try to keep things simple by focusing only on orthogonal designs with continuous

outcomes, a random-effects covariance structure, homogeneous covariance structures

within treatments, and complete data. In these designs, power is a function of the

frequency of observations, f, the duration of the study, D, the total number of

observations, M, the number of participants within each cluster, n, the number of clusters,

J, the effect size, ,δ the intra-class correlation, ρ , the within-person variance, , the

between-person variance,

pπτ , on the polynomial change parameter of order p, and

associated between-cluster variance, pβτ . The data lend to the three-level hierarchical

model described in the next section.

10.2. The Model

Data from a cluster randomized trial with repeated measures on the individuals

can be represented with a three-level model, with occasions nested within persons and

persons nested within clusters. The general level-1 model, or repeated measures model,

represents the trajectory of change for person i as a polynomial function of degree 1−P

defined at equally spaced observations. The model is:

mij

P

ppmpijmij ecY += ∑

=

1

0π , (1) ),0(~ 2σNemij

for observations, },...,2,1{ Mm ∈ },...,2,1{ ni ∈ persons, and },...,2,1{ Jj ∈ clusters,

where p is the polynomial order of change (e.g., linear, quadratic, or cubic);

pijπ is the level-1 coefficient for the polynomial of order p;

is the orthogonal polynomial contrast coefficient; pmc

is the error associated with the repeated measures; and mije

128

is the within-person variance. 2σ

Note the orthogonal polynomial contrast coefficients are necessary to center the data.

These coefficients are given by (see, e.g., Kirk 1982; Raudenbush and Liu 2001):

1 , (2) 0 =mc

∑=

−=M

mm Mmmc

11 / ,

⎟⎠

⎞⎜⎝

⎛−= ∑

=

M

mmmm Mccc

1

21

212 /

21 , and

⎟⎟⎟⎟

⎜⎜⎜⎜

−=

=

=mM

mm

M

mm

mm cc

ccc 1

1

21

1

41

313 6

1 .

The level-2 model, or person-level model, is:

pijjppij r+= 0βπ , ),0(~ ppij Nr πτ (3)

where jp0β is the cluster mean for the pth polynomial change parameter;

is the random effect associated with the persons; and pijr

pπτ is the between-person variance for the pth polynomial change parameter.

The level-3 model, or cluster level model is:

jpjppjp uW 001000 ++= γγβ , ),0(~0 pjp Nu βτ (4)

where 00pγ is the grand mean for the polynomial order of change;

01pγ is the main effect of treatment;

is a treatment contrast indicator, ½ for treatment and -½ for control; jW

is the random effect associated with each cluster; and jpu 0

pβτ is the between-cluster variance for the polynomial order of change.

129

To help clarify the general model, consider a first degree polynomial order of

change, or linear model ( . The level-1 model is: )1=p

mijmijijmij ecY ++= 110 ππ , (5) ),0(~ 2σNemij

for occasions, persons, and },...,2,1{ Mm ∈ },...,2,1{ ni ∈ },...,2,1{ Jj ∈ clusters,

where ij0π is the mean response for person i in cluster j on occasion m;

ij1π is the average rate of change for person i in cluster j on occasion m;

is the orthogonal linear contrast coefficient; mc1

is the error associated with the repeated measures; and mije

is the within-person variance. 2σ

Note that in the case of the linear model, the contrast coefficients are easily computed

using the formulas in equation 2. For example, if M=5, the orthogonal contrast

coefficients for a first degree polynomial are:

(6) )2,1,0,1,2(

)1,1,1,1,1(

1

0

−−==

cc

The level-2 model, or person level model is:

ijjij r0000 += βπ ),0(~ 00 πτNr ij (7)

ijjij r1101 += βπ ),0(~ 11 πτNr ij

where j00β is the mean response in cluster j;

j10β is the average growth rate in cluster j;

is the random effect associated with the mean response for person i in cluster j; ijr0

is the random effect associated with the growth rate for person i in cluster j; ijr1

0πτ is the between-person variance in means; and

1πτ is the between-person variance in growth rates.

The level-3 model, or cluster level model is:

130

jjj uW 0000100000 ++= γγβ ),0(~ 000 βτNu j (8)

jjj uW 1010110010 ++= γγβ ),0(~ 110 βτNu j

where 000γ is the grand mean;

001γ is the main effect of treatment for the mean;

is the treatment indicator, ½ for treatment and -½ for control; jW

100γ is the average growth rate;

101γ is the main effect of treatment for the growth rates;

is the random effect associated with the mean for each cluster; ju00

is the random effect associated with the growth rate for each cluster; ju10

0βτ is the between-cluster variance in means; and

1βτ is the between-cluster variance in growth rates.

Note that for a first degree polynomial, our primary interest is in growth rates, thus we

are interested in 101γ , the main effect of treatment on the growth rates, and in 1βτ , the

between-cluster variance in growth rates.

10.3 Testing the Main Effect of Treatment

The average treatment effect for the pth polynomial order of change in our

balanced design is defined in level 3 of the model. It is estimated by:

(9) CEp YY__

01 −=∧

γ

Note that the estimated main effect of treatment looks like that in the cluster randomized

trial except that now we are averaging over occasions and persons. The variance of the

treatment effect for the pth polynomial order of change (Raudenbush and Liu 2001) is:

J

nVVar ppp

p

]/)([4)ˆ( 01

++= πβ ττ

γ , (10)

)!()!1(22

1

2

2

pMKpMf

cV

p

p

M

mpm

p +−−

==

∑=

σσ ,

where is the frequency of observation; f

131

D is the duration of the study;

M is the total number of occasions, M=Df+1;

p is the polynomial order of change; and

is a constant, where =1/12, =1/720, and =1/100,800. pK 1K 2K 3K

Note that denotes the conditional variance of the least squares estimate of each

participant’s change parameter.

pV

We can translate the above formulas to a more concrete example in the case of a

first degree polynomial. For a first degree polynomial, the variance of the estimate of the

treatment effect is:

J

nVVar

]/)([4)ˆ( 111

101

++= πβ ττ

γ (11)

where )!1(12/1)!2(22

1

21

2

1 +−

==

∑=

MMf

cV M

mm

σσ . (12)

In the general case, we can use the following hypotheses to test the significance of

the main effect of treatment for the polynomial order of interest:

0:

0:

011

010

=

p

p

H

H

γ

γ (13)

When the null hypothesis is true, the test statistic is an F statistic and follows a central F

distribution, F(1, J-2). The test statistic is:

)ˆ(

ˆ

01

01

p

p

VarF

γγ

= . (14)

When the alternative hypothesis is true, the test statistic remains the same but

follows a noncentral F distribution, F(1, J-2; λ ). Recall that the noncentrality parameter

is a ratio of the squared treatment effect to the variance of the treatment effect estimate.

The noncentrality parameter is:

]/)([4)ˆ(

201

01

201

nVJ

Var ppp

p

p

p

++==

πβ ττγ

γγ

λ (15)

132

Recall that the larger the noncentrality parameter, the greater the power of the test.

Looking at the formula, we can see that J is the most influential sample size for

increasing the power. In other words, the number of clusters in more important than the

number of people within each cluster to increase the power. It is particularly important to

have a large number of clusters if there is a lot of between-cluster variation, pβτ . Also,

increasing the number of occasions, M, reduces the within-person variance, which

increases the power. Note that M is a function of f and D, where M=(fD+1) so increasing

the frequency of the observations or duration of the study increases M. Increasing n, the

number of people within each cluster, will also decrease the total within and between-

person variance, thus increasing the power. Finally, larger effect sizes increase the power

to detect a treatment effect.

Thus far, we have concentrated on the unstandardized model. However, similar to

cluster randomized trials, researchers typically use standardized models and effect sizes.

As discussed in Chapter 1, we will use Cohen’s rules of thumb for standardized effect

sizes, with 0.20, 0.50, and 0.80 as small, medium, and large effect sizes. Let’s see how

we translate the model to standardized notation.

The standardized effect size for a polynomial of order p is:

pp

p

πβ ττ

γδ

+= 01 (16)

where 01pγ is the main effect for the polynomial order of change, and

pp πβ ττ + is the total between-cluster and between-person variance, denoted τ .

In words, δ is the group difference on the polynomial of interest divided by the standard

deviation for that polynomial, or the square root of the sum of the between-cluster

variance and the between-person variance for the specified polynomial. Similar to

standardized models we defined in previous chapters, we need to define ρ , the intra-

class correlation. The intra-class correlation, ρ , is:

pP

p

πβ

β

τττ

ρ+

= (17)

133

w pp πβhere ττ + is the total between-cluster and within-cluster variance; τ =

is the between-cluster variance on the polynomial of interest; and pβτ

is the within-cluster variance on the polynomial of interest. pπτ

Note that if 1=τ , then ρτ β =p and ρτπ −= 1p which is consistent with the intra-cl

correlation for a cluster randomized trial. Also,

ass

ρ is a ratio of the between-cluster

riance to the total variance for a specific polynomial order of change. We can think of va

ρ as partitioning the g e va een-cluster and within-cluster

ompon

rowth-rat riance into a betw

c ent.

Using the standardized effect size, ,δ ρ , and constraining ,1=τ we can rewrite

e variance of the treatment effect estimth ate as

J

Var pp )ˆ( 01 =

nV ]/)1([4 +−+ ρργ . (18)

Another simplification involves rewriting the variance in terms of the reliability of the

person-specific polynomial change. The reliability is denoted pα and is defined as:

pp

p V+=

πτpπτ

α . (19)

ewriting the variance in terms of the reR liability we get:

J

Var pp )ˆ( 01

n)]/()1([4 ρ αργ = . (20)

We write the variance in this form because standard p

−+

rograms for hierarchical data often

give us

arameter in terms of the standardized

notation. The new noncentrality

an estimate of the person specific reliability.

We can also rewrite the noncentrality p

parameter is:

)]/()1([4 npαρρλ

−+= (21)

Note that the power is n a function of the number of clusters, J, the cl ter s

n, the standardized effect size, ,

2Jδ

ow us ize,

δ the intra-class correlation, ρ , the reliability, pα ,

which is a function of the between-person variance, pπτ , the within-person variance, ,2σ

134

the study duration, D, the frequency of the observations, f, and the number of occasio

M. It is important to be familiar with the standardized nota

ns,

tion because the Optimal

operates wit the standardized notation.

10.4 Ex

of

raders.

atment

ers

Design software h

ance to b

amples

To illustrate the use of repeated measures in a cluster randomized trial, consider

the example below, which is a modification of the example introduced at the beginning

the chapter. Imagine that researchers develop a new phonics program for first g

The program is a five year program. Students are assessed one time each year.

Researchers are interested in the growth rate of students so they propose a linear model.

40 classrooms have been randomly selected to participate in the study, 20 in the tre

group and 20 in the control group. Each classroom has 25 students and all 25 will

participate in the study. Based on a past 2-level repeated measures design, the research

estimate the within-person vari e 1.0 and the overall variability in the growth

rates = 1.0. They hypothesize 10.0=ρ but would like to allow ρ to vary along the x-

axis. They want to detect a minimum effect size of 0.30. What is the power of the test to

etect td he main effect of treatment for 10.0=ρ ?

First let’s list the information that is given.

J=40

n =25

ρ =0.10

δ = .30

M=

f =1

D=5

6 2 = 1.0 σ

τ = 1.0 (Note that this is pp πβ ττ + . We estimate how it is partitioned by .)ρ

Entering this information into the cluster randomized trials with repeated

measures option produces the graph in Figure 1.

135

Figure 1. Power vs. Intra-Class Correlation.

Power

0.11 0.21 0.30 0.40Intra-class correlation

0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0α = 0.050 F = 1 D = 5 M = 6

σ2 = 1.000000

τ = 1.000000 δ = 0.300000

J= 40,n= 25

Note that ρ is allowed to vary along the x-axis as a function of the power. The key in the

upper right hand corner can be used to confirm the settings. Clicking along the trajectory

at 10.0=ρ reveals that the power = 0.70. We can see that for smaller values of ρ , the

power is higher.

Another scenario a researcher might encounter in designing a cluster randomized

trial with repeated measures is described below. Suppose a researcher conducted a 3-level

pilot study about the phonics program described in the previous example. From the pilot

study, he estimates the within-person variation, the between-person variation

on growth rates,

,0.12 =σ

1πτ = 0.90, and the between-cluster variation on growth rates, 1βτ =0.10.

The students are tested one time in each of five years and 40 schools are selected to

participate with 25 students in each school. What is the power to detect an effect size of

0.30?

First, we need to translate the information given into a form that is acceptable in

the Optimal Design program. We know ,0.12 =σ ,90.01 =πτ and 10.01 =βτ . Thus we

can calculate:

136

10.010.090.0

10.0=

+=ρ and 0.190.010.0 =+=τ

Note that all of the parameters are the same as in example 1 so we will get the same

results. The idea here is that although the information may appear different in its original

form, it is important to translate it into the correct parameters required by the program

before continuing further.

137

11. Using the Optimal Design Software for Cluster randomized trials with Repeated

Measures

This chapter focuses on how to use the OD software to design a cluster

randomized trial with repeated measures. Section 11.1 provides general information

about how to use the cluster randomized trial with repeated measures option. Section 11.2

provides an example and details for using the software.

11.1 General Information

The cluster randomized trial with repeated measures option allows the researcher

to explore the power for the main effect of treatment as a function of the cluster size, n,

the number of clusters, J, the intra-class correlation, ρ , and the desired effect size, δ .

Below is the menu.

Cluster Randomized Trials Repeated Measures

Power vs. Cluster Size (n)

Power vs. Number of Clusters (J)

Power vs. Intra-class Correlation ( ρ )

Power vs. Effect Size (δ )

11.2.1 Example

For illustration purposes, we modify the example introduced in Chapter 10.

Imagine that a group of researchers develop a new phonics program for first graders. The

program is an intense year-long program. The researchers propose a repeated measures

design for students nested within schools. They plan to assess students at the beginning of

the year, prior to treatment, and then on six occasions throughout the year. Researchers

are interested in the growth rate of students so they propose a linear model. A past two-

level repeated measures design estimates the within person variability to be 10.0 and the

overall variability in growth rates to be 1.0. They want to explore different designs to try

to achieve high power. Four scenarios the researcher might encounter are presented

below. Assume alpha = 0.05 for each case.

138

11.2.2 Scenario 1

The researchers hypothesize ρ = 0.05. In other words, 5% of the total variation in

growth rates is between-cluster variation. They want to detect a minimum effect size of

0.25. Assuming 40 schools are willing to participate in the study, how many students do

they need in each school to achieve power = 0.80?

In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size

(n) option. Figure 1 displays the screen.

Figure 1. CRTRM - Power vs. Cluster Size (n).

Let’s take a closer look at the function of each of the buttons on the toolbar.

α - specifies the significance level, or chance of a Type I error. By default, α is set at

0.05, which is a common level for most designs.

J – specifies the number of clusters. By default, J is set at 20.

ρ - specifies the intra-class correlation. Recall it is defined as pp

p

πβ

β

τττ

ρ+

= . By default,

ρ is set at 0.05 and 0.10.

δ - specifies the effect size. By default, δ is set at 0.40.

Set – Within the Set button, there are a variety of settings listed below.

139

F – specifies the frequency of observations.

D – specifies the duration of the study.

M – specifies the total number of observed occasions and is a function of F and D

where M=(FD+1).

Variability of level-1 residual, - specifies the within-person variation 2σ

Variability of level-1 coefficient, τ - specifies the sum of the between-person and

between-cluster variation, pβpπ ττ +

Polynomial Order – allows the researcher to select either a linear, quadratic, or

cubic model.

The remaining options on the toolbar are the same as those in the Cluster randomized

trial option, which are explained in full detail in Chapter 2.

Now let’s use the software to explore the question in Scenario 1. To answer the

question, click on Power vs. cluster size (n). Then move along the toolbar and specify

J=40, 05.0=ρ , and 25.0=δ . Click on the Set button and change D=6. Note this makes

M=7 because M=(DF+1). Set to 10.0 and note that2σ τ is set to 1.0, which matches our

design. The linear model selection is already checked which also matches the design.

Figure 2 displays the results.

140

Figure 2. CRTRM - Power vs. Cluster Size (n).

Note that the key in the upper right corner reflects the settings we identified. The cluster

size is allowed to vary along the x-axis as a function of the power. Clicking along the

trajectory reveals that 50 students per school are required to achieve power = 0.80. Note

that the power does not converge to 1 as the sample size per cluster is increased.

11.2.3 Scenario 2

The researchers hypothesize ρ = 0.05. Again, 5% of the total variation in growth

rates is between-cluster variation. They want to detect a minimum effect size of 0.25.

Assuming 25 students are willing to participate in each school, how many schools do

they need to achieve power = 0.80?

In Scenario 2, the number of clusters is unknown thus we select power vs. number

of cluster (J) option. This allows the number of clusters to vary along the x-axis. As a

result, J will be replaced on the toolbar by the cluster size (n) icon. The remaining options

function as previously described. Now let’s use the software to explore the question in

Scenario 2.

To answer the question, click on Power vs. number of clusters (J). Then move

along the toolbar and specify n=25, 05.0=ρ , and 25.0=δ . Click on the Set button and

141

change D=6. Note this makes M=7 because M=(DF+1). Set to 10.0 and note2σ τ is set to

1.0 which matches our design. The linear model selection is already checked which also

matches the design. Figure 3 displays the results.

Figure 3. CRTRM - Power vs. Number of Clusters (J).

Clicking along the trajectory reveals that 54 clusters are necessary to achieve power =

0.80. Note that unlike the cluster size, as the number of clusters increases, the power

tends towards 1.0. This corresponds to the information presented in Chapter 7, which

states that the number of clusters is more influential on power than cluster size.

11.2.4 Scenario 3

The researchers want to detect a minimum effect size of 0.25. Assume they are

able to secure 40 schools and 25 students within each school. What value of ρ achieves

power = 0.80? What does this value of ρ mean?

In Scenario 3, the intra-class correlation, ρ , is unknown thus we select the power

vs. intra-class correlation option. Again, the only change in the toolbar is that the ρ no

longer appears since it is allowed to vary along the x-axis. This is a very useful option for

the case where the overall variability in the polynomial order of change is estimated from

142

a 2-level repeated measures design, but researchers are unclear about how the variability

is partitioned into between-person and between-cluster variability. This option allows the

researchers to calculate the power for different values of ρ . Let’s explore Scenario 3 to

get a better idea of this option.

To answer the question, click on Power vs. intra-class correlation ( ρ ). Then

move along the toolbar and specify J=40, n=25, and 25.0=δ . Click on the Set button and

change D=6. Note this makes M=7 because M=(DF+1). Set to 10.0 and note2σ τ is set to

1.0 which matches our design. The linear model selection is already checked which also

matches the design. Figure 4 displays the results.

ρ ). Figure 4. CRTRM - Power vs. Intra-class Correlation (

Clicking along the trajectory reveals that 02.0=ρ results in power = 0.80. This means

that only 2% of the overall variability in growth rates can be attributed to the between-

cluster variation. Note that as the intra-class correlation increases, or more of the

variability is between clusters, the power of the test decreases.

143

11.2.5 Scenario 4

The researchers hypothesize ρ = 0.05. Again, 5% of the total variation in growth

rates is between-cluster variation. Assume that they are able to secure 40 schools and 25

students within each school. What is the minimum effect size they can detect with power

= 0.80?

In Scenario 4, the effect size is unknown thus we select power vs. effect size (δ )

option. Again, the only change in the toolbar is thatδ no longer appears since it is

allowed to vary along the x-axis. Let’s explore Scenario 4 using the software.

To answer the question, click on Power vs. effect size (δ ). Then move along the

toolbar and specify J=40, n=25, and 05.0=ρ . Click on the Set button and change D=6.

Note this makes M=7 because M=(DF+1). Set to 10.0 and note2σ τ is set to 1.0 which

matches our design. The linear model selection is already checked which also matches

the design. Figure 5 displays the results.

Figure 5. CRTRM - Power vs. Effect Size (δ ).

144

Clicking along the trajectory reveals that 29.0=δ results in power = 0.80. Note that as

the effect size increases, the power also increases, which is consistent with the

information in Chapter 7.

145

146

References

Kirk, Roger. E. 1982. Experimental Design: Procedures for the Behavioral Sciences.

Second ed. Belmont, CA: Brooks/Cole. Raudenbush, Stephen W. 1997. Statistical Analysis and Optimal Design for Cluster

Randomized Trials. Psychological Methods 2 (2):173-185. Raudenbush, Stephen W., and Anthony S. Bryk. 2002. Hierarchical Linear Models:

Applications and Data Analysis Methods. 2 ed. Thousand Oaks, California: Sage Publications.

Raudenbush, Stephen W., and Xiaofeng Liu. 2000. Statistical Power and Optimal Design for Multisite Randomized Trials. Psychological Methods 5 (2):199-213.

———. 2001. Effects of Study Duration, Frequency of Observation, and Sample Size on Power in Studies of Group Differences in Polynomial Change. Psychological Methods 6 (4):387-401.