Optimal Design for Longitudinal and Multilevel …...alternative hypothesis. In a two-treatment...
Transcript of Optimal Design for Longitudinal and Multilevel …...alternative hypothesis. In a two-treatment...
Optimal Design for Longitudinal and Multilevel Research:
Documentation for the “Optimal Design” Software
Jessaca Spybrook
University of Michigan
Stephen. W. Raudenbush
University of Chicago
Xiao-feng Liu
University of South Carolina
Richard Congdon
Harvard University
Andrés Martínez
University of Michigan
Applies to Optimal Design Version 1.76
Last Revised on March 12, 2008
Preface
The Optimal Design software, developed with support from the National Institute
of Mental Health and the William T. Grant Foundation, now contains modules that can
assist researchers in planning single level trials, cluster randomized trials, multi-site
randomized trials, multi-site-cluster randomized trials, cluster randomized trials with
treatment at level three, trials with repeated measures, and cluster randomized trials with
repeated measures.
We regard this version of the software as a “beta version,” meaning that we
distribute it for use under the condition that those who use it are asked to promptly report
difficulties or errors to Andres Martinez ([email protected]), Stephen W. Raudenbush
([email protected]) and/or Jessaca Spybrook ([email protected]). We will
attempt to make needed changes quickly. This documentation will also be revised based
on reviewers’ comments.
2
Table of Contents
1. Cluster Randomized Trials ............................................................................................. 4
2. Including a Cluster Level Covariate in a Cluster randomized trial .............................. 25
3. Using the Optimal Design Software for Cluster Randomized Trials ........................... 32
4. Cluster Randomized Trials with Binary Outcomes ...................................................... 56
5. Using the Optimal Design Software for Cluster Randomized Trials with Binary
Outcomes .......................................................................................................................... 60
6. Multi-site Cluster Randomized Trials ........................................................................... 67
7. Using the Optimal Design Software for Multi-site Cluster Randomized Trials ........... 89
8. Three Level Models with Randomization at Level Three .......................................... 109
9. Using the Optimal Design Software for the Three Level Model with Treatment at
Level Three ..................................................................................................................... 118
10. Repeated Measures in Cluster Randomized Trials ................................................... 127
References ....................................................................................................................... 146
3
1. Cluster Randomized Trials
Cluster randomized trials have become a popular design choice in social science
research. These trials rely on the assignment of clusters to treatments. For example,
assume there are 40 schools in an experiment. In a cluster randomized trial, 20 schools
may be assigned to the experimental treatment, a new math series, and 20 schools may be
assigned to the control, the regular math series. Note that unlike typical designs,
individuals are not randomly assigned to treatment or control, but rather clusters or
groups of individuals are assigned. Readers who are familiar with hierarchical linear
models can think of this as a two level design, students nested within schools
(Raudenbush and Bryk 2002). Here, students are the level-one units and schools are the
level-two units. The treatment contrast is defined at level two.
The first three chapters in this manual provide researchers with a guide to
effectively designing a cluster randomized trial for a continuous outcome. The first
chapter provides an overview of key statistical terms and background information
relating to cluster randomized trials. The second chapter introduces the concept of a
cluster-level covariate to cluster randomized trials. Chapter 3 describes how to use the
Optimal Design software to design a cluster randomized trial with and without a cluster-
level covariate.
1.1 Components of a Cluster randomized trial
In a cluster randomized trial our primary goals are to estimate the difference
between treatments and to determine if their difference is statistically significant. For
example, in the case of the new math series, we might want to determine if there is a
difference in mean math achievement between schools that implement the new series and
schools that use the regular series. Typically, math achievement is measured by a test, so
we might look to see if the students experiencing the new math series scored significantly
higher on average than the students experiencing the regular math series on an
appropriate test given to both groups. To determine if there is a significant difference
4
between the two group means, we must have adequate statistical power. In a completely
balanced cluster randomized trial, the power to detect a difference between the two
groups, or the main effect of treatment, depends on the cluster size (n), the number of
clusters (J), the intra-class correlation ( ρ ), and the effect size (δ ). The remainder of this
chapter examines each of the components of a cluster randomized trial and how they
affect the power of the study.
1.1.1 Statistical Power
Power is the probability of rejecting the null hypothesis when a specific
alternative hypothesis is true. In a study comparing two groups, power is the chance of
rejecting the null hypothesis that the two groups share a common population mean and
therefore claiming that there is a difference between the population means of the two
groups, when in fact there is a difference of a given magnitude. It is thus the chance of
making the correct decision, that the two groups are different from each other. Power is
linked to discussions of hypothesis testing and significance levels so it is important to
have a clear definition of each of these terms before proceeding. Note that in a perfectly
implemented randomized experiment with correctly analyzed data, power is the
probability of discovering a causal effect of treatment when such an effect truly exists.
In hypothesis testing, there are two hypotheses, a null hypothesis and an
alternative hypothesis. In a two-treatment design, the most common null hypothesis states
that there is no significant difference between the population mean for the treatment
group and the control group. The alternative hypothesis states that there is a difference
between groups. The difference may be expressed as a positive treatment effect, a
negative treatment effect, or simply that the treatment mean is not equal to the control
mean. For example, in the case of the new math series, the null hypothesis states that on
average, math achievement will be the same for students using the regular math series
(control group) and students using the new math series (experimental group). However,
the researchers believe that the new math series is better than the regular series. In this
case, the alternative hypothesis states that average math achievement for the experimental
5
group is higher than that of the control group. Thus the alternative hypothesis states that
there is a positive treatment effect. After the hypotheses are clearly stated and the data
has been collected and analyzed, the researcher must decide if there is sufficient evidence
to reject the null hypothesis.
The significance level, often denoted α , is the probability of rejecting the null
hypothesis when it is true. This is known as a Type I error rate. A Type I error occurs
when the researcher finds a significant difference between two groups that do not, in fact,
differ. In the math example, a Type I error would occur if we conclude that students using
the new math series scored higher, on average, than the control group when in fact there
is no difference between the two groups. Typically, alpha is set at 0.05 so that, when the
null hypothesis is true, there is only a 5% chance of making this type of mistake.
Suppose, however, that the null hypothesis is indeed false. A Type II error arises
when we mistakenly retain the null hypothesis. The probability of retaining a false null
hypothesis, often denoted β , is therefore the Type II error rate. In the math example, a
Type II error occurs if a researcher concludes that, on average, math achievement for the
two groups is the same when in fact students using the new math series achieve higher
than students using the regular math series. In this case, the researcher overlooks a
significant difference. The two types of errors are illustrated in Table 1.
Table 1: Possible errors in hypothesis testing.
Do Not Reject the Null
Hypothesis
Reject the Null
Hypothesis
Null Hypothesis is True No Error
(Probability = 1-α )
Type I Error
(Probability = α )
Null Hypothesis is False Type II Error
(Probability = β )
No Error
(Probability = 1- β )
6
If the null hypothesis is true (first row of Table 1), the correct decision is to retain
the null and the probability of this correct decision = Probability (Retain is true)
= 1-
00 | HH
α . With 05.0=α , for example, the probability is 0.95 that we will make the correct
decision of retaining when it is true. The incorrect decision in this case is the Type I
error – rejecting the true . When is true, this error will occur with probability
0H
0H 0H
05= .0α .
On the other hand, if the null hypothesis is false (second row of Table 1) the
correct decision is to reject it. If the probability of making this correct decision is defined
as power = Probability (Reject is false)=00 | HH β−1 . The incorrect decision, known as
the Type II error occurs with probability β , that is Prob(Type II error| false)=0H β .
Looking at the results of a study retrospectively, we know that a researcher who
has retained (column 1 of Table 1) has either made a correct decision or committed a
Type II error. In contrast, a researcher who has rejected (column 2) has either made a
correct decision or committed a Type I error. Note that it is logically impossible for a
researcher who has rejected to have made a Type II error. To criticize such a
researcher for designing a study with low power in this case would be a vacuous
criticism, since a lack of power cannot account for a decision to reject . However, a
researcher who retains the null hypothesis may have committed a Type II error and is
therefore potentially vulnerable to the criticism that the study lacked power. Indeed, low
power studies in which is retained are virtually impossible to interpret. One cannot
claim a new treatment to be ineffective in a study having low power because, by
definition, such a low power study would have little chance of detecting a true difference
between two populations represented in the study.
0H
0H
0H
0H
0H
Although Type I and Type II errors are mutually exclusive, the choice of α can
affect power. Suppose a researcher, worried about committing a Type I error, sets a lower
α , say 001.0=α . If the null hypothesis is true, this researcher will indeed be protected
7
against a Type I error. However, suppose is false. Setting 0H α very low will reduce
power, equivalent to increasing β , the probability of a Type II error. While keeping in
mind that the choice of α affects power, we will for simplicity assume 05.0=α in the
remainder of this discussion in order to focus on sample size as a key determinant of
power.
Of course, neither type of error is desirable and we would prefer to make the
correct decision. As a result, we want the probability of correctly detecting a difference,
that is the power, to be large. Think again about the math example. Assuming the new
math series works better than the control series, we want high power to detect a
difference between the group using the new math series and the group using the regular
math series. In other words, assuming the new curriculum is effective, we seek a high
probability of rejecting the null hypothesis and concluding that, on average, students
using the new math series have higher math achievement. For example, if the power is
0.80, we will correctly identify a difference between the groups with probability 0.80.
Power greater than or equal to 0.80 is often recognized by the research community to be
sufficient, though some researchers seek 0.90 as a minimum.
In a cluster randomized trial, the power of a test is a function of the cluster size, n,
and number of clusters, J, the intra-class correlation, ρ , and the effect size,δ holding
α constant. As we shall see, given ρ and δ , the power in cluster randomized trials is
dominated by the number of clusters, not the number of subjects within a cluster.
Therefore to increase the power, we generally want to increase the number of clusters.
However, increasing the number of clusters may be far more expensive than adding
additional subjects within a cluster, which can be problematic since all studies have a
fixed budget.
Consider the math example. Once a new math program is implemented within a
school, it is relatively inexpensive to test more students and include them in the sample.
Adding more clusters, or schools, is much more expensive. Adding a new school requires
securing an agreement with school leaders to participate, training additional teachers in
8
the new program, buying the necessary supplies for a school, and paying for data
collectors to travel to the school. This can be very costly, and it may not be feasible to
include a large number of schools.
In addition to sample size, the desired effect size and intra-class correlation
coefficient also contribute to the power of the test. Larger effect sizes produce higher
power. Smaller values of the intra-class correlation coefficient, which measures the
fraction of variation lying between schools, also increase power. However, the researcher
does not have as much control over these quantities as they are strongly determined by
the phenomenon under investigation. Let’s take a closer look at the model to see how n,
J, ρ , and δ influence power.
1.1.2 The Model
We can write the model for a cluster randomized trial in hierarchical form, with
individuals nested within clusters. The level-1, or person-level model is:
ijjij eY += 0β , (1) ),0(~ 2σNeij
for persons per cluster and },...,2,1{ ni ∈ },...,2,1{ Jj ∈ clusters,
where is the outcome for person i in cluster j; ijY
j0β is the mean for cluster j;
ije is the error associated with each person; and
is the within-cluster variance. 2σ
The level-2 model, or cluster-level model is:
jjj uW 001000 ++= γγβ ),0(~0 τNu j (2)
where 00γ is the grand mean;
01γ is the mean difference between the treatment and control group or the main
effect of treatment;
jW is the treatment contrast indicator, ½ for treatment and -½ for control;
9
ju0 is the random effect associated with each cluster; and
τ is the variance between clusters.
Replacing (2) in (1) yields the mixed model:
ijjjij euWY +++= 00100 γγ , and . (3) ),0(~0 τNu j ),0(~ 2σNeij
We are interested in the main effect of treatment, 01γ , estimated by:
CE YY__
01ˆ −=γ , (4)
where is the mean for the experimental group and is the mean for the control
group. When each treatment has an equal number, J/2, of clusters, the variance of the
main effect of treatment is:
EY_
CY_
JnVar )/(4)ˆ(
2
01στγ +
= (5)
where n is the total number of participants per cluster and J is the total number of
clusters.
1.1.3 Testing the Main Effect of Treatment
We can use hypothesis testing to determine if the main effect of treatment is
“statistically significant,” that is, not readily attributable to chance. Recall that a two-
tailed null hypothesis states there is no difference whereas the alternative hypothesis
states there is a difference. In symbols: 0: 010 =γH
0: 011 ≠γH
If the data are balanced, that is, there is an equal number of participants in each cluster,
we can use the results of a two factor nested ANOVA to test the main effect of
treatment.1 The test statistic is an F statistic, which compares treatment variance to
cluster variance. The F statistic is defined as:
1 This is the same result we would obtain using a two-level hierarchical linear model (Equations 1and 2) estimated by means of restricted maximum likelihood.
10
)()(
cluster
treatment
MSMS
Fstatistic = (6)
Note the F statistic converges to the ratio of expected mean squares, which is defined as:
2
201
2
201
2 4/1
4/)()(
στ
γ
στ
γστ
++=
+
++=
nnJ
n
nJnMSE
MSE
cluster
treatment (7)
and can be rewritten as:
λ+= 1)()(
cluster
treatment
MSEMSE
where 2
201 4/
στ
γλ
+=
nnJ . (8)
If the null hypothesis is true, the F statistic follows a central F distribution with 1 degree
of freedom for the numerator and J-2 degrees of freedom for the denominator. Under the
central F distribution, we would expect the F statistic to be approximately 1. In other
words, there is no variation between treatments so 001 ≈γ and the term in the
numerator of the expected mean square ratio goes towards 0. We see that if
4/201γnJ
0=λ the
ratio of expected mean squares thus reduces to .1=12
2+=
+
+
σσ
))(
= λττ
nn
EE
(MSMS
cluster
treatment
If the null hypothesis is false so that there is a treatment difference, that is 001 ≠γ ,
the F statistic follows a non-central F distribution with 1 degree of freedom for the
numerator and J-2 degrees of freedom for the denominator. Then the ratio of expected
mean squares becomes the non-central F distribution, characterized by a non-centrality
parameter, λ (See Equation 8). λ can be rewritten as:
Jn /)/(4 2
201
στγ
λ+
= (9)
Note that λ , known as the non-centrality parameter, is the ratio of the squared main
effect to the variance of the estimate of the treatment effect. Equation 9 clearly shows that
the non-centrality parameter, λ , is a function of 01γ , n, J, τ , and . 2σ
The non-centrality parameter is strongly related to the power of the test. As
λ increases, the power increases. Let’s see what makes λ increase. Increasing the
treatment effect increases λ . Thus, if we are trying to detect a larger difference in means,
λ increases and so the power also increases. Note that the denominator is identical to the
11
variance of the treatment effect (Equation 5). So to increase λ we could decrease the
variance of the main effect of treatment. Because the standard error of the treatment
effect is more commonly discussed, instead of referring to the variance of the main effect
of treatment, we will refer to the standard error of the main effect of treatment, which is
simply:
J
nSE )/(4)ˆ(2
01στγ +
= (10)
Notice that increasing n and J will decrease the standard error thus increasing the power.
Also, decreasingτ and will decrease the standard error and increase the power. The
remainder of this chapter explores how n, J,
2σ
τ and affect the power of the test. 2σ
1.1.4 Cluster Size, n
The cluster size, n, refers to the number of participants in each cluster. In the
school example, n is the number of students in the new math series group (experimental
group) or the number of students in the regular math series group (control group). In
general, increasing n decreases the standard error of the treatment effect (equation 10)
thus increasing the power. However, at some point, increasing n without increasing the
number of clusters, J, provides no further benefit. Thus as ∞→n , we can see that for
Equation 10, JSE /2)ˆ( 01 τγ = , which will be zero unless .0 =τ
1.1.5 Number of Clusters, J
As the total number of clusters, J, increases, the power to detect significant
differences also increases. As mentioned earlier, the number of clusters has a stronger
influence on power than the cluster size. As J increases towards infinity, the power
approaches 1 regardless of n. This is because as J increases towards infinity, the standard
error (10) gets infinitely small. This causes the non-centrality parameter to increase
towards infinity, which results in the power approaching 1. Intuitively this makes us think
that we should just continue to increase J until the desired power is achieved. However,
12
increasing J or adding additional clusters may not be feasible due to budgetary
constraints. Choosing the optimal sample size with a fixed budget is discussed more
thoroughly in section 1.2.
1.1.6 Intra-Class Correlation, ρ
In addition to the number of clusters, the variability between clusters also affects
power. The variability is defined in terms of the intra-class correlation coefficient, ρ . The
intra-class correlation, ρ , is a ratio of the variability between clusters to the total
variability:
2σττρ
+= (11)
where τ is the variation between clusters;
is the variation within clusters; and 2σ
is the total variation. 2στ +
For US data sets on school achievement, ρ typically ranges between 0.05 and 0.15. In
neighborhood research on mental health, ρ will generally be smaller . Because is
the total variation, we can constrain it to be 1. Algebraic manipulation of the formula then
reveals and . As
2στ +
τρ = 21 σρ =− ρ increases we know more of the variation is due to
between-cluster variability. Replacing τ and with 2σ ρ and 1- ρ in the standard error
formula (equation 10), the standard error of the main effect of treatment can be rewritten
as:
J
nSE )/)1((4)ˆ( 01ρργ −+
= (12)
From equation 12, we can see that increased values of ρ increase the standard error thus
decreasing the power. Also, as ρ increases, the effect of n decreases. Therefore, if there is
a lot of variability between clusters, we gain more power by increasing the number of
clusters sampled. The key idea for ρ is that power increases as ρ decreases for a fixed n
and J.
13
1.1.7 Standardized Effect Size,δ
The treatment effect is the difference between the mean of the two groups.
However, because data for different experiments is collected in different scales,
standardizing the data is important so the results are meaningful to any researcher, not
just someone who is familiar with a particular data set. A standardized effect size,δ , is
the population means difference of the two groups divided by the standard error of the
outcome:
2
01
στ
γδ
+= (13)
where CE μμγ −=01 ;
Eμ is the population mean for the experimental group; and
Cμ is the population mean for the control group;
Given and 2σ τ , the standardized effect size, δ , is estimated by:
τσδ
+
−=
∧
2
__
CE yy (14)
and
JnSE )/)1((4)ˆ( ρρδ −+
= (15)
Standardized effect sizes between 0.50 and 0.80 are considered large, and effect sizes as
small as 0.20 to 0.30 are often considered worth detecting. Note that larger effect sizes
are easier to detect. The interpretation of a given δ as “large” or “small,” is, however,
sensitive to the research setting and the capacity of researchers to implement powerful
treatment and to measure outcomes with high validity.
Prior to calculating the actual effect size from the sample, the researcher must
specify a desired minimum effect size to calculate the power of the test. Recall that the
power of the test is driven by the non-centrality parameter, λ (equation 9). We can
redefine λ in terms of the standardized effect size as shown below:
14
)/)1((4)1(4/ 22
nJ
nnJ
ρρδ
ρρδλ
−+=
−+= (16)
Now we can calculate the power of the test knowing only n, J, δ , and .ρ The Optimal
Design Software uses the standardized model notation.
1.1.8 Example
Let’s take a look at an example to see how the various components work together
to affect the power of a test. Suppose a new literacy program has been developed. The
founders of the new program propose that students who participate in the program will
have increased reading achievement. They decide to conduct a cluster randomized trial.
Based on past studies, they estimate ρ as 0.05, meaning that 5% of the total variation in
the outcome lies between clusters. The researchers want to be able to detect a minimum
effect size of 0.20, or 20% of a standard deviation. Note this is a small effect size.
Assume they have 50 students in each cluster. How many clusters are necessary to
achieve power = 0.80?
Let’s take a look at Figure 1, produced with the OD software. The graph shows
the power on the y-axis varying as a function of the number of clusters on the x-axis,
while holding constant ρ =0.05, δ =0.20, and n=50.
15
Figure 1: CRT - Power vs. Number of Clusters
Number of clusters
Power
23 42 61 80 99
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 50
δ= 0.20,ρ= 0.05
We can see that as J increases, the power increases towards 1.0. Clicking on the graph at
where power = 0.80 reveals J = 56. This means a total of 56 clusters, 28 per treatment,
are necessary to achieve power = 0.80 when ρ =0.05, δ =0.20, and n=50.
Let’s see how the graph would change if the expected effect size is increased to
0.40 while holding ρ =0.05 and n=50. Since this is a larger effect size, we would expect
to be able to achieve power = 0.80 without needing as many clusters. Figure 2 displays
both graphs.
16
Figure 2: CRT - Power vs. Number of Clusters
Power
23 42 61 80 99
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 50
δ= 0.20,ρ= 0.05δ= 0.40,ρ= 0.05
Number of clusters
Looking at the two graphs we can see that if the effect size is 0.5, fewer clusters are
needed to achieve power of 0.80. Clicking on the trajectory reveals that 16 clusters, 8 per
treatment, are necessary to achieve power = 0.80. Recall 56 clusters were necessary for
an effect size of 0.20 so this is a big reduction.
Let’s see how the graph would change for different values of ρ . Assume that two
values of ρ based on past studies are 0.05 and 0.10. We expect the power to decrease as
ρ increases for a fixed sample size. Let’s take a look at Figure 3.
17
Figure 3: CRT - Power vs. Number of Clusters
Power
23 42 61 80 99
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 50
δ= 0.20,ρ= 0.05δ= 0.20,ρ= 0.10δ= 0.40,ρ= 0.05δ= 0.40,ρ= 0.10
Number of clusters
For both effect sizes, the larger value of ρ increases the number of clusters necessary to
achieve power = 0.80 to increase. Though the increase in ρ may seem small, to achieve
power = 0.80 with δ = 0.20 and ρ =0.10, the number of clusters necessary jumps from
56 to 96.
Let’s see how things change if we allow the cluster size to vary and fix the
number of clusters. The graph in Figure 4 allows n to vary along the x-axis and shows the
corresponding power along the y-axis for a fixed ,,δρ and J.
18
Figure 4: CRT - Power vs. Number of Subjects per Cluster
Power
13 24 35 46 57
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 J = 20
δ= 0.20,ρ= 0.05δ= 0.20,ρ= 0.10δ= 0.40,ρ= 0.05δ= 0.40,ρ= 0.10
Number of subjects per cluster
As we can see in the graphs, increasing n does not make the power increase
towards 1. When the number of clusters was allowed to increase, the power increased
much more rapidly towards 1, which shows that the number of clusters is more influential
than the cluster size in increasing power. Thinking back to our standard error formula,
this is exactly as we would expect. Recall the standard error formula below:
SE(Main Effect of Treatment)J
n)/)1((4 ρρ −+=
As J increases toward infinity, the standard error get infinitely small thus the power will
increase towards 1. However, as n gets larger, the standard error does not get infinitely
small. Instead the standard error of the main effect of treatment reduces to:
SE(Main Effect of Treatment) = Jρ4 as approaches n ∞
Thus it is clear that increasing J has a greater affect on the power than increasing n under
.0=ρ
19
This example makes it clear that there are many things to consider when planning
a cluster randomized trial. The OD software is a tool for helping the researcher design a
study with the appropriate number of subjects, clusters, and adequate power. Details
regarding how to produce the figures in the example are discussed in Chapter 3.
1.2 Optimal Sample Allocation
As described in 1.1, power depends on within cluster sample size, n, the number
of clusters, J, the intra-class correlation, ρ , and the desired effect size,δ . ρ and δ are
typically estimated by the researcher based on prior knowledge and similar studies. This
leaves the sample size components, n and J for the researcher to specify.
It is a common belief that increasing n will increase the power. However, as we
saw earlier increasing n only increases power to a certain point. Power is more strongly
affected by increasing J rather than increasing n in cluster randomized designs. This may
suggest that the best thing to do is just to make J very large. However, as previously
discussed, adding more clusters is often expensive, and usually costs more than adding
people within a cluster. Because many studies are on a limited budget, it is important to
find the optimal allocation of n and J to for a fixed budget.
The total variable cost of data collection can often be reasonably approximated by
the formula below:
)( 21 CnCJT += (17)
where J = number of clusters;
n = number of participants within a cluster;
C1 = cost per participant;
C2 = cost per cluster; and
T = total cost.
20
To calculate the optimal sample size, first find the optimal n and then find the
optimal J. The optimal n in this case is that which minimizes the variance of the
treatment effect. Recall the variance of the main effect of treatment is defined in equation
5 is
JnVar )/(4)ˆ(
2
01στγ +
= .
Substituting 21 CnC
TJ+
= (a simple rearrangement of the cost equation) and minimizing
the equation with respect to n, we obtain the formula for optimal n:
1
2*CC
noptτσ
= (18)
where σ is the within cluster standard deviation;
τ is the between cluster standard deviation;
C1 is the cost per person; and
C2 is the cost per cluster
From the formula, we can see as the within-cluster variance increases relative to
the between-cluster variance, optimal n increases. Intuitively this makes sense. If there is
large variation within clusters, we would want to sample more people in each cluster to
represent that variation. However, if the within cluster variation is very small, optimal n
decreases. In this case, we want fewer people in each cluster because most of the
variation is between clusters so adding more people will not be very helpful. In terms of
the cost ratio, if the cost per cluster becomes increasingly larger than cost per person we
are penalized for adding clusters and the optimal n increases. After the optimal n is
found, the number of clusters can be calculated by plugging back n into the formula for J:
21 CnCTJ+
= (19)
The cost per cluster and cost per person may be the same in the control and experimental
groups or it may differ. The remainder of this chapter looks at optimal sample allocation
when costs of sampling the two groups are equal and when they are not equal.
21
1.2.1 Equal Costs
The simplest case is when the sampling costs are the same for the treatment and
control groups. The following example illustrates how to calculate the optimal n and the
resulting J to minimize the variance for a fixed budget.
A researcher wants to determine the effect of a new drug prevention program in
schools. The total budget for sampling costs is $10000. The cost per cluster (C2) is $400
and cost per person (C1) is $20. The estimated intra-class correlation coefficient is 0.05.
What is the optimal n? How many clusters will be in the study? Using formulas 16 and
17 described above, the optimal n and J can be computed by hand as shown below.
Step 1: Set =1, so and . For this example, 2στ + ρτ = ρσ −= 12 τ =.05 and =.95 2σ
Step 2: Calculate 2236.=τ and 9747.2 =σ
Step 3: Find the cost ratio 1
2
CC = 400/20 = 20
Step 4: Set up the equation 2020
400*2236.9747.
≈=optn
Plugging 20 into 21 CnC
TJ+
= yields J = 12.5 which is rounded down to 12 in order to
stay within budget. The value of the variance of the treatment effect can also be
calculated by plugging in n and J to the variance equation.
The Optimal Design software can be used to do these calculations. The software
produces a plot as shown below:
22
Figure 5: CRT - Optimal n vs. rho
Intra-class correlation
n
0.06 0.11 0.15 0.20 0.25
4.4
8.9
13.3
17.8
22.2
26.7
31.1
35.6
40.0
44.5α = 0.050
C2/C1=20.000
The plot allows the researcher to see how the optimal n changes with respect to
the intra-class correlation coefficient. Notice that as ρ increases, optimal n decreases. In
other words, if there is large between-cluster variance then it is not very helpful to
increase the number of people per cluster and more money should be spent trying to
increase the number of clusters.
Notice that in the previous example there were no power calculations or set effect
sizes. If the desired effect size is specified, then the Optimal Design software can be used
to calculate the optimal n and J that maximizes power. For example, recall in the example
above that: T=$10,000, C2=$400, C1=$20, and 05.0=ρ . Imagine that the desired effect
size is 0.40. Plugging these values into the OD software which solves for n and J to
maximize the power reveals an optimal n = 18, J =13, and power = 0.53. Knowing that
the power is only 0.53 and acceptable power levels are typically 0.80 or higher, the
researcher may need to try to increase the budget in order to achieve higher power.
23
1.2.2 Unequal Costs
If the cost of sampling persons or clusters varies across the treatment groups, the
optimal design will not be balanced, even assuming variances to be the same in each
treatment group). The optimal cluster size and/or the optimal number of clusters will be
different as a function of these cost differences. However, the current version of the
Optimal Design software does not provide optimal allocation formulas in this setting.
24
2. Including a Cluster Level Covariate in a Cluster randomized trial
A common problem facing researchers designing cluster randomized trials is that
the cost of a study frequently limits the number of clusters, resulting in a lack of
statistical power. One method to combat this problem is to include a covariate in the
design and analysis of a cluster randomized trial. Including a covariate may reduce the
number of clusters necessary to achieve a specified level of power. This chapter provides
a brief conceptual background for including a cluster level covariate in a cluster
randomized trial.
2.1 Why include a cluster level covariate?
To illustrate, let’s consider another new math program. Suppose the goal of the
study is to determine if a new 2nd grade math series is superior to the standard math
series. Let’s assume that 10.0=ρ and the researcher seeks to discover a minimum effect
size of 0.20. Assume there are 50 2nd grade students from each school and the researcher
has secured 50 schools, 25 in the treatment group and 25 in the control group. Due to
budgetary constraints, 50 schools and 50 2nd grade subjects within each school is the
maximum number available to the researcher. Entering δρ, , n and J into the cluster
randomized trial option in the Optimal Design software reveals that the researcher only
has power = 0.52. The low power makes it difficult for the researcher to detect the
expected effect. Thus, an important effect may well go undetected. Including a covariate
in the design and analysis may greatly increase the power.
In this chapter, we focus specifically on including a covariate at the cluster level.
This may be an aggregated covariate, such as pre-test scores aggregated across schools or
school SES. Recall that power in a cluster randomized trial is a function of the minimum
detectable effect size, ,δ the intra-class correlation, ,ρ the number of clusters, J, and the
cluster size, n, while holding α constant. When we include a covariate in the design, there
is an additional component that influences the power of the test: the strength of the
correlation between the covariate and the true cluster mean outcome. The strength of the
25
correlation between the covariate and the true cluster mean is denoted 0βρ x . We adopt
this notation because ojβ is the true mean outcome for cluster j, and is the covariate.
The residual level-2 variance, or unexplained variance after accounting for the covariate,
is denoted . As we will see later, the stronger the correlation,
jX
x|τ0βρ x , the smaller the
conditional level 2 variance, , compared to the unconditional level 2 variance, , and
the greater the benefit of the covariate in increasing precision and power. Let’s take a
closer look at the model with a cluster-level covariate.
x|τ τ
2.2 The Model
In hierarchical form, the level-1 model for a cluster randomized trial with a
cluster-level covariate is the same as the level-1 model in Chapter 1
ijj eY +0β (1) ),0(~ 2σNeijij =
for persons per cluster and },1{ ni ∈ },...,2,1{ Jj ∈ clusters, ,...,2
where j0β is the mean for cluster j;
ije is the error associated with person i in cluster j; and
is the within-cluster variance. 2
j
σ
The level-2 model, or cluster-level model differs from a simple cluster randomized trial
because it includes a term for the cluster-level covariate. The model is:
jW X ojj u (~0 j Nu ),0 |x (2) +++00=0 0201 γγγβ , τ
where 00γ is the grand mean;
01γ is the mean difference between the treatment and control group or the main
effect of treatment;
02γ is the regression coefficient for the cluster-level covariate;
jW is the treatment contrast indicator, ½ for treatment and -½ for control;
ijX is the cluster-level covariate, centered around its group mean;
ju0 is the random effect associated with each cluster; and
26
x|τ is the residual variance between clusters.
Note that the between-cluster variance, x|τ , is now the residual variance
conditional on the cluster level covariate X. For the purposes of this paper, we assume
there is no interaction between the cluster level covariate, X, and the treatment group, W.
This is an assumption that can be relaxed and in general should be checked given that a
researcher is interested in how the treatment effect may vary at different levels of the
covariate.
Similar to the cluster randomized trial without a covariate, we are interested in the
main effect of treatment, or the difference between the treatment average and control
average adjusting for the covariate. However, now it is estimated by:
(3) )(__
02
^__
01
^CECE XXYY −−−= γγ
where is the mean for the experimental group; EY_
CY_
is the mean for the control group;
is the covariate mean for the experimental group; and EX_
is the covariate mean for the control group CX_
Note that the estimated main effect of treatment looks like the estimated effect
without the covariate except that here we are adjusting for treatment group differences in
the covariate. The variance of the main effect of treatment is estimated by (Raudenbush
1997):
⎥⎦⎤
⎢⎣⎡
−+
+=
411
)/(4)ˆ(
2|
01 JJn
Var x στγ (4)
where n is the total number of subjects;
J is the total number of clusters; and
x|τ is the conditional level 2 variance, . xx |2 )1(
0τρ β−
27
2.3 Testing the Main Effect of Treatment
Similar to the case without a covariate, we can use hypothesis testing to determine
if the main effect of treatment is “statistically significant,” this is, not readily attributable
to chance. If the data are balanced, we can use the results of a nested analysis of
covariance with random effects for clusters and fixed effects for the treatment and
covariate. The test statistic is an F statistic, which compares adjusted treatment variance
to the adjusted cluster variance. The F statistic is defined as:
clusters
treatment
MSMS
Fstatistic = , (5)
where and are now adjusted for the covariate. treatmentMS clusterMS
Note that the F statistic converges to the ratio of expected mean squares, defined as:
xclusters
treatment
MSEMSE
λ+= 1)()(
(6)
The F test follows a non-central F distribution, F(1,J-3, )xλ in the case of a cluster-level
covariate where the non-centrality parameter, xλ , is:
)/(4 2
|
201
nJ
xx
στ
γλ
+= (7)
and
. (8) τρτ β )1( 2| oxx −=
From equations 7 and 8, we can see that the stronger the correlation, 0βρ x , the smaller
x|τ , and the greater the increase in the power of the test.
Note that the non-centrality parameter with and without the covariate are closely
related. If the correlation between the covariate and the cluster level mean is 0, x|τ reduces
to and the non-centrality parameter reduces toτ ,λ the non-centrality parameter in the
case of no covariate. Although we are reducing the between cluster variance, one
consequence of including a covariate is that we lose one degree of freedom. In the case of
no covariate, the F test follows a non-central F distribution, F(1, J-2, )λ whereas in the
28
covariate case we have );3,1( xJF λ− . This may be a potential problem in a study with a
small number of clusters.
The non-centrality parameter can be defined in standardized notation. Recall that
in equation 7 we define the non-centrality parameter as )/(4 2
|
201
nJ
xx στ
γλ
+= . Replacing
, constraining , and defining τρτ β )1( 2| oxx −= 1=2+στ
2
10
στ
γδ
+= we can rewrite xλ
as a function of δ , ρ and , as shown below: 0βρ x
]/)1()
2
nJ
ρρδ
−+ (9)
1[(4 20x
x ρλ
β−=
Note that the only difference in the non-centrality parameter in the case of the cluster
level covariate is the correction factor, . The correction factor only affects )1( 20βρ x− ρ ,
the between-cluster variation since the covariate is a cluster-level covariate. As the
correlation between the covariate and the cluster level means increases, the unconditional
intra-class correlation decreases. This results in an increase in the value of the non-
centrality parameter and therefore an increase in the power of the test.
2.4 Example
Recall that in the example in section 2.1, the researchers wanted to test the
effectiveness of a new 2nd grade math series. They estimated ,10.0=ρ desired a
minimum effect size of 0.20, had 50 clusters and 50 subjects per cluster. The power to
detect an effect was only 0.52. Suppose the researchers plan to give the students a pre-test
prior to implementing the new program. Pre-test scores will be aggregated to the school
level. Based on past research, they estimate that the pre-test has a correlation of 0.75 with
the true cluster mean post-test. In other words, the cluster-level pre-test scores explain
percent of the variation in true cluster-level post-test scores. What is the
new power that the researchers achieve when they include the covariate in the design?
5625.075.0 2 =
29
To determine the power, we can use the cluster randomized trial option from the
OD software. Figure 1 below shows the trajectory with and without the covariate.
Figure 1. Power vs. number of clusters.
Number of clusters
Power
0.1
23 42 61 80 99
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 50
δ= 0.20,ρ= 0.10δ= 0.20,ρ= 0.10,R2
L2= 0.56
The trajectory indicated by the dotted line uses the information contained in the covariate.
Clicking along the trajectory, we find that with 50 clusters the researchers can achieve
power = 0.80. In other words, the power to detect an effect increased from 0.52 to 0.80 by
including the covariate.
Although the use of a covariate can substantially increase power in a cluster
randomized trial, there are caveats. First, choice of the covariate must be specified during
the design phase, prior to any data analysis. A procedure based on checking model
estimates based on a list of possible covariates using the study data will produce biased
tests of the significance of the treatment effect. If point estimates of the treatment effect
look very different with and without adjustment of the covariate, a skeptic may suspect
30
opportunistic choice of the covariate. Results that are highly sensitive to choice of the
covariate inevitable arouse uncertainty. Finally, as mentioned, the researcher should
check the assumption that the covariate association with the outcome is the same across
treatment groups.
31
3. Using the Optimal Design Software for Cluster Randomized Trials
This chapter focuses on how to use the Optimal Design (OD) software to design a
cluster randomized trial with a continuous outcome. Section 3.1 gives some general
information regarding the software. Section 3.2 presents an example that is used to
illustrate the OD software and is explored in detail in subsequent sections. Section 3.3
explains how to use the software to design a study with budgetary constraints.
3.1 General Information
The screen shown in Figure 1 is the OD screen that appears when the software is
opened.
Figure 1. Main menu OD screen.
In this chapter, we focus on the Cluster Randomized Trial (continuous outcomes)
and Optimal Sample Allocation option. Chapter 5 discusses how to use the Optimal
Design program for the case of a Cluster randomized trial with a binary outcome.
Clicking on the Cluster Randomized Trials heading displays the options listed below:
32
Power for main effect of treatment (continuous outcome)
Power vs. cluster size (n)
Power vs. number of clusters (J)
Power vs. intra-class cluster correlation (rho)
Power vs. effect size (delta)
Power vs. proportion of explained variation by level 2 covariate (R2)2
Power for main effect of treatment (binary outcome)
Power vs. cluster size (n)
Power vs. number of clusters (J)
Power vs. probability of success in treatment group (phi(E))
Optimal sample allocation under budgetary constraints
All of the subheadings under power for the main effect of treatment (continuous
outcome) offer the researcher an opportunity to explore one specific design element in
relation to the power of the study. Each option produces a graph with power on the y-axis
and the specified design element on the x-axis. For example, power vs. cluster size allows
the researcher to see how the power of the test changes (y-axis on the graph) as the
cluster size increases (x-axis on the graph) for fixed J, ρ , and δ . The five subheadings
under power for the main effect of treatment function similarly, so once you are familiar
with one option, the others follow easily. The option for Optimal sample allocation under
budgetary constraints allows a researcher to design a study with a fixed budget. This
option is slightly different from the previous options and is discussed in section 3.3.
Below are a few general things to keep in mind when using the OD software:
1. After clicking on any of the four power options, a new screen with a toolbar
will appear similar to the one in Figure 2:
2 Note that this differs from Version 1.0 of the program. In Version 1.0, the program asked for a covariate correlation. In Version 2.0, the program asks for the proportion of explained variation by the level 2 covariate, R2.
33
Figure 2. CRT - Power vs. cluster size (n) screen.
However, a graph will not appear until you click on one of the buttons on the toolbar and
click ok.
2. Once you click one option, for example, power vs. cluster size (n), you cannot
click on another option until you click on the X to close the graph.
3.2 Example
Suppose a team of researchers develop a new literacy program for 1st graders. The
founders of the new program propose that students who participate in the program will
have increased reading achievement. They plan to test students who participate in the
new program (experimental group) and students who participate in the regular program
(control group) using a reading test to determine if students using the new program score
higher. The researchers have access to last years 1st grade average reading test scores for
each school. Past data reveals that last years scores explain 49% of the variation in test
scores. The researchers want to design a cluster randomized trial with students nested
within classrooms but are not sure how to proceed. Five scenarios the researchers might
encounter are presented below. Assume 05.0=α for each case.
34
3.2.1 Scenario 1 – Unknown Cluster Size (n)
Based on past studies, the researchers estimate ρ = 0.05 and want to be able to
detect a minimum effect size of 0.25. Assuming 40 classrooms (clusters) are willing to
participate in the study, how many students per classroom are necessary to achieve
power = 0.80? Find the power of the study with and without the covariate.
In Scenario 1, the cluster size, or number of students per school is unknown. As a
result, we want to select the power vs. cluster size (n) option. This allows the cluster size
to vary along the x-axis. To explore the power vs. cluster size (n) option, click on it.
Figure 3 displays the screen that appears.
Figure 3. CRT - Power vs. cluster size (n) screen.
The toolbar runs across the top of the window. Let’s take a closer look at the function of
each of the buttons on the tool bar.
α - specifies the significance level, or chance of a type I error. By default α is set at 0
which is a common level for most designs.
.05,
J – specifies the number of clusters. By default, J is set at 20, but it can be changed based
on the researcher’s needs.
35
δ - specifies the minimum effect size of interest. By default, the minimum effect size is
set at 0.20 and 0.40. Trajectories for both effect sizes are plotted so they can be
compared. The researcher is allowed to enter up to three different effect sizes.
−ρ specifies the intra-class correlation. Be default, ρ is set at 0.05 and 0.10. Again two
values are specified to allow for comparisons. The researcher is allowed to enter up to
three different values of ρ . 2
2LR - specifies the proportion of the variation in the level 2 outcome that is explained by
level 2 covariate.
<x< – sets the range of the x-axis. The x-axis displays the range of the cluster size n. By
default it is set to 2 to 60, but the researcher can change the range.
<y< - sets the range of the y-axis. The y-axis displays the range of the power. Power
ranges from 0 to 1.
Plot graph – plots the graph with all the default settings.
IEG – sets the graph legends. This allows the researcher to give specific labels and titles
to a graph.
Save – saves the graph (See Appendix A for details)
Print – prints the graph
Defs – sets the parameters to default setting.
? – is a help option.
X – closes the window.
Note: Clicking ok after clicking on any of the buttons along the toolbar automatically
displays the graph with the default settings. Once the graph is on the screen, clicking on a
specific parameter allows you to change or add values for that parameter.
Follow the steps below to answer the question.
Step 1: Click on power vs. cluster size (n).
Step 2: Click on J on the toolbar and change J(1) to 40, the total number of clusters in
this study. Clicking ok makes the graphs appear in the window. Below is the new screen:
36
Figure 4. CRT - Power vs. cluster size with J=40, δ =0.20 and 0.40, ρ =0.05 and 0.10
Note there is a legend that appears in the upper right corner. This defines each of the
trajectories on the screen. This is a quick way to check if δ and ρ are defined correctly. In
our case, since we want δ =0.25, we need to change the settings.
Step 3: Click on δ on the toolbar. Notice delta(1) is set to 0.20 and delta (2) is set to 0.40,
which are the default settings. Change delta(1) to 0.25 and delete delta(2). This allows us
to compare the number of subjects necessary per cluster if we desired a minimum effect
size of 0.25. An additional value of delta can also be added if desired. Click ok. The new
screen is in Figure 5.
37
Figure 5. CRT - Power vs. cluster size with J=40, δ =0.25, ρ =0.05 and 0.10
Step 4: Looking at the legend, we know the correct ρ for this example is specified.
However, click on ρ on the toolbar to see the options. Notice rho(1) is set to 0.05 and
rho(2) is set to 0.10, the default settings. By leaving rho(2) at 0.10, we are able to see
how changing the value of rho affects the necessary cluster size for a specified power. An
additional value of rho can also be added if desired. Click ok. Since we did not make any
changes, the screen stays the same.
Step 5: Looking at the legend, we know the correct α is specified. Clicking on α on the
toolbar we see α is set to 0.05.
Step 6: Recall that in Scenario 1 we are trying to determine the number of people we
need in each cluster to achieve power of 0.80 with J=40, δ =0.25, and ρ =0.05. Using the
legend in Figure 5 to find the trajectory that matches our specifications, we can click
along the correct trajectory until the power = 0.80 to determine the appropriate n. In this
case, n = 37, so 37 people are required in each cluster. However, 37 people in one
classroom might be unreasonable. Let’s see what happens when we add a covariate.
38
Step 7: To add the information from the covariate, click on . Now you may enter up
to three values for . Leave equal to 0 but enter equal to 0.49. Click
ok. The new screen is in Figure 6.
22LR2LevelR2
2LR )1(22LevelR )2(2
Figure 6. CRT - Power vs. cluster size with J=40, δ =0.25, ρ =0.05 and 0.10,
and 0.49. 022 =LR
By including the covariate, we can achieve power = 0.80 for δ =0.25 and ρ =0.05 with a
classroom (cluster) size of only 19, which is more realistic.
Step 8: Add another value for δ by clicking on δ on the toolbar and specify delta(3) =
0.50. Click ok. The new screen is in Figure 7.
Figure 7. CRT - Power vs. cluster size with J=40, δ =0.25, 0.50, ρ =0.05, 0.10, and
, 0.49. 022 =LR
39
As you can see, 8 trajectories appear on the screen, one for each combination of δ , ρ ,
and . The key in the upper right corner defines the various trajectories. Notice the
larger desired effect sizes achieve higher power with fewer people per cluster than do the
smaller effect sizes. Intuitively, this makes sense because it is easier to detect a larger
effect size than a smaller effect size. Also, larger values of
22LR
ρ decrease the power for a
specified effect size. Note that the power does not approach 1 in every case because
increasing n only increases power to a certain point.
Click X on the toolbar to close the screen and select a new option.
3.2.2 Scenario 2 – Unknown Number of Clusters (J)
Based on past studies, the researchers estimate ρ = 0.05 and want to be able to
detect a minimum effect size of 0.25. Assuming that 20 students are willing to participate
in the study from each classroom, how many classrooms (clusters) are necessary to
achieve power = 0.80? Find the power of the study with and without the covariate.
40
In Scenario 2, the number of clusters, J is unknown. As a result, we want to select
the power vs. number of clusters (J) option. This allows the number of clusters to vary
along the x-axis. Clicking on power vs. number of clusters (J) reveals a blank screen and
a toolbar that looks very similar to the toolbar for power vs. cluster size (n) in Figure 3.
The only difference is now there is an n on the toolbar instead of a J. This is because now
n is set while J is allowed to vary.
Follow the steps below to answer the questions:
Step 1: Click on power vs. number of clusters (J).
Step 2: Click on n on the toolbar. Click ok because the default is 20, which is the number
of students per cluster in this study. Figure 8 displays the screen.
Figure 8. CRT - Power vs. number of clusters with n=20, δ =0.20,0.40, and ρ =0.05,
0.10.
41
Looking at the key, we know we need to changeδ since we are looking for a minimum
effect size of 0.25, which is not the default setting. We can also see that we need to
delete ρ =0.10 since we are interested in ρ =0.05.
Step 3: Click on δ and change delta (1) = 0.25 and delete delta (2).
Step 4: Click on ρ and delete rho (2). Figure 9 displays the new screen.
Figure 9. CRT - Power vs. number of clusters with n=20, δ =0.25, and ρ =0.05.
We can click along the correct trajectory until the power = 0.80 to determine the
necessary J when there is no covariate. Clicking along the trajectory reveals that
approximately 50 clusters are necessary to obtain power =0.80. Notice that as the number
of clusters increases, the power approaches 1 for each trajectory.
Step 5: Click on in order to include the covariate in the power analysis. In order to
compare the designs with and without a covariate, leave (1) equal to 0. Recall that
the covariate explained 49% of the variation in the cluster-level outcome so enter 0.49
for (2). Figure 10 displays the new screen.
22LR
22LevelR
22LevelR
42
Figure 10. CRT - Power vs. number of clusters with n=20, δ =0.25, ρ =0.05, and =0
and 0.49.
22LR
Clicking along the trajectory that includes the covariate, we can see that 40 clusters are
necessary to achieve power = 0.80. Including the covariate reduced the total number of
clusters by 10 which will help reduce the costs of the experiment.
3.2.3 Scenario 3 – Unknown intra-class correlation (rho)
The researchers have 40 classrooms in the study and 50 students per classroom.
They want to be able to detect a minimum effect size of 0.25. What value of the intra-
cluster correlation coefficient results in power = 0.80? Consider the case with and without
the covariate.
In Scenario 3, the intra-class correlation, ρ , is unknown. As a result, we want to
select the power vs. intra-class correlation (rho) option. This allows the intra-class
correlation to vary along the x-axis. Clicking on power vs. intra-class correlation (rho)
reveals a blank screen similar to Figure 3. However, ρ no longer appears on the toolbar
because it is the unknown quantity.
43
Follow the steps below to investigate the question.
Step 1: Click on power vs. intra-class correlation (rho).
Step 2: Click on δ on the toolbar and change delta(1) to 0.25. Leave delta(2) at 0.50 for
comparison purposes. Click ok. Figure 11 displays the screen that appears.
Figure 11. CRT - Power vs. ρ with n=50, J=20, andδ =0.25 and 0.50
Note that the legend reveals J=20, n=50, and α =0.05. Since the Scenario specifies J=40,
we need to change the setting for J. The settings for n and α are correct.
Step 3: Click on J on the toolbar. Change J(1) to 40 because there are 40 schools in the
example. Click ok. The new screen is in Figure 12.
44
Figure 12. Power vs. ρ with n=50, J=40, andδ =0.25 and 0.50
Step 4: Recall that in Scenario 3 we are trying to determine the intra-class correlation that
results in power of 0.80 with n=50, J=40, andδ =0.25. Using the legend to find the
trajectory that matches our specifications, click along the appropriate trajectory to
determine the value of the intra-class correlation that results in power = 0.80. The result
is ρ =0.055. Notice that as the intra-class correlation increases, or more of the variation is
due to between-cluster variation, the power of the test decreases, which is consistent with
the results in Chapter 1.
Step 5: Click on in order to include the covariate in the power analysis. Leave
(1) equal to 0 but set (2) equal to 0.49. Let’s remove the extra effect size in
order to keep the screen manageable. Click on
22LR
22LevelR 2
2LevelR
δ and delete delta (2). Figure 13 displays
the new screen.
45
Figure 13. Power vs. ρ with n=50, J=40, δ =0.25, and =0 and 0.49. 22LR
Note that clicking along the dotted trajectory reveals that by including the covariate, a
ρ equal to 0.11 will achieve power = 0.80. In other words, by including the cluster-level
covariate, we can have a larger unconditional ρ and still achieve the desired power.
3.2.4 Scenario 4 – Unknown minimum effect size
The researchers have 40 classrooms in the study and 30 students per classroom.
Based on past studies, the intra-class correlation coefficient is 0.05. What is the minimum
effect size the researchers can detect with power = 0.80? Consider the case with and
without the covariate.
In Scenario 4, the minimum effect size is unknown. As a result, we want to select
the power vs. effect size (delta), which allows the effect size to vary along the x-axis.
Clicking on power vs. effect size (delta) reveals a blank screen with a toolbar similar to
Figure 2. However, δ no longer appears on the screen because it is the unknown quantity.
Follow the steps below to answer the question.
Step 1: Click on power vs. effect size (delta).
46
Step 2: Click on J on the toolbar. Change J(1) to 40 since there are 40 clusters in the
study. Click ok. Figure 14 displays the screen that appears.
Figure 14. CRT - Power vs. δ with n=50, J=40, and ρ =0.05 and 0.10
Note that the legend shows that n=50, ρ =0.05, and 05.0=α so we do not need to change
any of the settings. However, we can delete ρ =0.10 since it is not required for this
design. Click on ρ and delete rho (2). Figure 15 displays the new screen.
47
Figure 15. CRT - Power vs. δ with n=50, J=40, and ρ =0.05.
Recall that in Scenario 3 we are trying to determine the minimum effect size that results
in power of 0.80 with n=50, J=40, and ρ =0.05. Clicking along the trajectory until the
power is 0.80 we can see that the minimum effect size the researchers can detect with
power = 0.80 and no covariate is approximately 0.24.
Step 3: Click on in order to include the covariate in the power analysis. Leave
(1) equal to 0 but set (2) equal to 0.49. Figure 16 displays the new screen.
22LR
22LevelR 2
2LevelR
48
Figure 16. CRT - Power vs. δ with n=50, J=40, ρ =0.05 and =0 and 0.49. 22LR
Including the covariate in the design allows the researchers to find a minimum detectable
effect size equal to 0.19.
3.2.5 Scenario 5 – Unknown explanatory power of the cluster-level covariate
The researchers have 40 classrooms in the study and 30 students per classroom.
Based on past studies, the intra-class correlation coefficient is 0.05. They want to detect a
minimum detectable effect size of 0.25. Under these constraints, how much of the cluster-
level variation does the covariate need to explain in order to achieve power = 0.80.
In Scenario 5, the explanatory power of the cluster-level covariate is unknown. As
a result, we want to select the power vs. proportion of explained variation by level 2
covariate (R2). Clicking on power vs. proportion of explained variation by level 2
covariate (R2) reveals a blank screen with a toolbar similar to Figure 2. However, no
longer appears on the screen because it is the unknown quantity.
22LR
Follow the steps below to answer the question.
49
Step 1: Click on default settings.
Step 2: Click on n and set n=30.
Step 3: Click on J and set J=40.
Step 4: Click on δ and set delta (1) = 0.25. Delete delta (2).
Step 5: Click on ρ and delete rho (2). The final screen is in Figure 17.
Figure 17. CRT - Power vs. with n=30, J=40, 22LR δ =0.25, and ρ =0.05.
Clicking along the trajectory reveals that if the covariate explains 13% of the variation in
the level-2 outcome, the power equals 0.80.
3.3 Optimal Sample Allocation Under Budgetary Constraints
This section focuses on planning a cluster-randomized study with a fixed budget.
Throughout this section we assume sampling costs for the treatment group are the same
as those for the control group. For example, imagine that for the literacy example
described in Section 2.2, the cost of sampling each school is $500, regardless of whether
the school receives the new program or not. The cost of sampling a student within each
50
school is $25. Also imagine that the total budget for the study is $20,000. Knowing the
sampling costs allows the researcher to answer 2 questions.
1. What is the optimal n that minimizes the variance of the treatment effect
under these budgetary constraints?
2. What is optimal n and J to maximize the power under these budgetary
constraints?
Both questions can be answered using the OD software and are discussed in Section 2.3.1
and 2.3.2.
3.3.1. Optimal n vs. ρ to minimize variance
The optimal design software allows a researcher to determine the optimal n that
minimizes the variance of the treatment effect for various values of ρ . Figure 18
displays the screen for the Optimal n vs. rho to minimize variance option.
Figure 18. Optimal n vs ρ screen.
Below is a description of the first three buttons on the toolbar since they differ from
previous screens.
C2/C1 – specifies the cost ratio. C2 is the cost per cluster and C1 is the cost per person.
51
<x<x – sets the range of the x-axis. The x-axis displays the possible values of ρ . By
default ρ ranges between 0.01 and 0.25.
<y< - sets the range of the y-axis. The y-axis displays the values of optimal n.
Let’s use the software to answer question 1. Recall the questions asks for the optimal n
that minimizes the variance of the treatment effect when C2 =$500, C1 =$25, and the total
cost is $20,000. Follow the steps below to use answer the question.
Step 1: Click on Optimal Sample Allocation under Budgetary Constraints – Equal Costs–
Optimal n vs. rho to minimize variance.
Step 2: Calculate C2/C1 = 500/25 = 20. Click on C2/C1. By default the cost ratios are set
to 5 and 20. By leaving C2/C1 (1)=5 and C2/C1 (2)=20, we can make comparisons f
different cost ratios. Click ok. Figure 19 displays the new screen.
or
Figure 19. Optimal n vs ρ
Note the legend identifies the two trajectories based on the cost ratio.
Step 3: Clicking along the trajectory for C2/C1 = 20 we can investigate the optimal n for
different values of ρ . For example, if ρ = 0.05, then optimal n is approximately 20.
52
Step 4: The software does not calculate the corresponding J based on optimal n and total
cost. However, we can do this by hand using the formulas from Chapter 1.
20500)25*20(
000,20
21
=+
=+
=CnC
TJ
So to minimize the variance of the treatment effect in this situation, we need 20 people
per school and 20 schools.
3.3.2. Maximizing Power
The OD software can also be used to calculate the optimal n and J to maximize
the power for a study. The optimal n to maximize power will generally be close to the
optimal n needed to minimize variance. However, in settings with small J, the results will
differ somewhat.
Because this option maximizes the power, a minimum effect size must also be
specified. Recall question 2 asked for the optimal n and J to maximize the power when
the cost of sampling a cluster is $500, the cost of sampling a person is $25, and the total
cost is $20,000. To use the software, ρ , δ , and α must be specified. Assume ρ =0.05,
δ =0.30, and α =0.05. Follow the steps below to calculate the optimal n and J for the
above specifications.
Step 1: Click on Optimal Sample Allocation under Budgetary Constraints – Equal Costs–
Maximizing Power. Figure 20 displays the screen.
Figure 20. Optimal Sample Allocation – Maximize Power
53
Step 2: Specify the appropriate input by clicking on each box and changing the value
match the criteria set forth in the example. Figure 21 displays the correct input.
Figure 21. Optimal Sample Allocation – Maximize Power
Step 3. Click Compute. The optimal n, J and power are now displayed. Figure 22 shows
the final screen.
Figure 22. Optimal Sample Allocation – Maximize Power
Notice that the optimal n and J are both 20, which is the same as the results we calculated
in section 2.3.1 for 05.0=ρ . We expect this to be the same because minimizing the
variance of the treatment effect is the same as maximizing the power. The only difference
in this option is that we are also specifying a minimum effect size, which allows us to
54
calculate the power. If we increase the minimum effect size to 0.40, the power increases
to 0.77 but the optimal n and J remain the same because the cost ratio and ρ (which
determine the optimal sample sizes) are not influenced by the effect size.
55
4. Cluster Randomized Trials with Binary Outcomes
Chapters 1, 2, and 3 discuss cluster randomized trials with continuous outcomes.
In Chapters 4 and 5 we investigate power for a cluster randomized trial with a binary
outcome. Chapter 4 provides a brief conceptual background and chapter 5 describes how
to use the Optimal Design Software to design a cluster randomized trial with a binary
outcome.
4.1 General Description of the CRT with a Binary Outcome
The general design of a CRT with a binary outcome is the same as a CRT with a
continuous outcome: students nested within schools, or more generally, the level-1 units
nested within the level-2 unit. However, the outcome variable is different. For example,
the outcome for a study might be whether or not a student drops out of school or whether
or not a student drinks alcohol in high school. The variable has only two possibilities so it
is a binary outcome. Because of the structure of the data, the model for a CRT with a
binary outcome is different than the model for a CRT with a continuous outcome. Let’s
take a closer look at the model.
4.2 The Model
The model for a CRT with binary outcome can be thought of as an extension of
the generalized linear model applied to a multi-level setting. The level-1 model is
comprised of three parts: the sampling model, the link function, and the structural model.
The level-1 sampling model defines the probability that the event will occur. The
sampling model is below:
),(~| ijijijij mBY φφ (1)
for persons per cluster and for },...,2,1{ jni ∈ },...,2,1{ Jj ∈ clusters;
where is the number of trials for person i in cluster j; and ijm
ijφ is the probability of success for person i in cluster j.
56
The expected value and variance of Y ijij φ| are:
ijijijij mYE φφ =)|(
)|( ijijij mYVar
)1( ijij φφφ −= (2)
Note that in the case of a Bernoulli trial, = 1 so the expected value of ijm ijijY φ| reduces
to ijφ and the variance reduces to ijφ (1- ijφ ). A common link function for a binary outcome
is the logit link:
⎟⎟⎠
⎞⎜⎜⎝
⎛
−=
ij
ijij φ
φη
1log (3)
where ijη is the log odds of success.
Let’s investigate the relationship between the probability of success, the odds of
success, and the log odds of success. If the probability of success, ijφ , is 0.50, then the
odds of success are 0.5/(1-0.5)=1, and the log odds of success is log (1)=0. If the
probability of success, ijφ , is greater than 0.5, then the odds of success are greater than 1,
and the log odds of success is positive. If the probability of success, ijφ , is less than 0.5,
then the odds of success is less than 1 and the log odds of success is negative.
The third part of the level-1 model is the structural model:
jij 0βη = (4)
where j0β is the average log odds of success per cluster j.
The level 2 model is the same as the level-2 model for a CRT with a continuous
outcome. However, the interpretation of the parameters differs because of the logit link
function:
jjj uW 001000 ++= γγβ , ),0(~0 τNu j (5)
where 00γ is the average log odds of success across clusters;
57
is the treatment effect in log odds; 01γ
is ½ for treatment and -½ for control; jW
j is the random effect associated with each cu0 luster mean; and
τ is the between cluster variance in log odds.
.3 Testing the Main Effect of Treatment
the
4
The framework for testing the main effect of treatment in the case of a binary
outcome variable is very similar to the case of a continuous outcome variable. In
model above (equation 5), the treatment effect is denoted 01γ . It is estimated by:
CE ηηγ −=1 (6)
where E
0ˆ
η is the predicted mean for the experimental group in logs odds and Cη is the
predicted mean for the control group in log odds. The variance of the estimated treatment
ffect can be approximated by:
e
JnVar )/(4)ˆ(
2
01στγ +
= (7)
where =2σ 2/)1(
1)1(
1⎟⎟⎠
⎞⎜⎜⎝
⎛−
+− CCEE φφφφ
. (8)
lows a non-central Z-distribution. The non-centrality
arameter is given below:
The test statistic fol
p
)ˆ( 01
01
γγλ
Var= =
)/)/(4 2
01
Jnστ
γ
+ (9)
1 skedastic, which makes the meaning of the intra-
class co relation
In the case of a binary outcome, we do not typically standardize the model
because the level- variance is hetero
r , ρ , uninformative.
58
4.4 Example
Suppose a team of researchers wants to determine whether a new drug prevention
program for middle school students reduces the probability that a student does drugs.
outcome variable is whether or not the student does any drugs prior to entering high
school. Assume that the school mean probability that a student tries any type of drugs
before high school is 0.4, and that the school means vary such that the lower bound is 0.1
and the upper bound is 0.6. The researchers expect that the school mean probability tha
student who participates in the program will try drugs is 0.25. Thus far the researchers
have recruited 40 total schools, 20 that will implement the new drug prevention prog
and 20 that will continue with their current policies. Within each school, they have
recruited 100 students. What power do they have to d
Th
t
ram
etect the desired treatment effect?
e
a
igure 1 displays the graph from the OD Software.
Figure 1. Power vs. cluster size.
F
1.0 = 0.050 αφE = 0.250000
0.9 φ
Cluster size
Power
31 60 89 118 147
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
C = 0.400000low er plausible value = 0.100000upper plausible value = 0.600000
J= 40
e that with a cluster size of 100, the power to detect the
treatment effect is 0.88.
From Figure 1, we can se
59
5. Using the Optimal Design Software for Cluster Randomized Trials with Binary
Outcomes
This chapter focuses on how to use the OD software to design a cluster
randomized trial with a binary outcome. Section 5.1 provides general information for
how to use the CRT with a binary outcomes module. Section 5.2 provides an example
and details regarding the options within the module.
5.1 General Information
The CRT with a binary outcome option allows the researcher to calculate the
power for the average treatment effect as a function of the cluster size (n), the number of
clusters (J), and the probability of success in the experimental condition ( Eφ ). The menu
is in the Cluster Randomized Trial module and is given below:
Power for the main effect of treatment (binary outcome)
Power vs. cluster size (n)
Power vs. number of clusters per site (J)
Power vs. probability of success in the experimental condition (phi(E))
5.2.1 Example
A team of researchers is investigating the effects of a new “Stay in School
Campaign.” They believe that students that participate in the program are more likely to
graduate from high school than students who do not participate in the program. The
program targets 12th grade students. The program is implemented at the school level thus
we have a nested data structure of students within schools. The outcome for the study is
whether or not a student graduates from high school in 4 years. Based on past data, the
researchers expect the probability that a student graduates from high school in 4 years to
be 0.6. The researchers are unsure how to plan the cluster randomized trial. Three
scenarios they might encounter are described in the remainder of the chapter.
60
5.2.2 Scenario 1
The researchers anticipate the probability that a student graduates to be 0.75 in
schools that adopt the new “Stay in School Campaign.” They have a total of 40 schools,
20 in the experimental group and 20 in the control group. How many students do they
need from each school to detect their desired treatment effect? Assume the bounds for the
school mean proportion graduating in the control schools are [0.5, 0.9].
In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size
(n) option. Figure 1 displays the screen.
Figure 1. CRT with binary outcome screen.
The buttons in the toolbar are explained below.
α - specifies the significance level, or chance of a Type I error. By default, α is set at
0.05, which is a common level for most designs.
J – specifies the number of clusters. By default, J is set at 20.
Eφ - specifies the probability of success in the treatment condition. By default, Eφ is set at
0.60.
61
Cφ - specifies the probability of success in the control condition. By default, Cφ is set at
0.40.
PI – specifies the 95% plausible interval for . This is the range the researcher would
expect for the school mean probability of success for the schools in the control group.
Cjφ
The remaining options in the toolbar are the same as those in the CRT option for
continuous outcomes. The details can be found in Chapter 3.
Follow the steps below to answer the questions:
Step 1: Click on Power vs. cluster size (n).
Step 2: Click on J. Set J = 40.
Step 3: Click on Eφ . Set Eφ =0.75.
Step 4: Click on Cφ . Set Cφ =0.60.
Step 5: Click on PI. Set the lower bound = 0.50 and the upper bound = 0.90. Note that I
also set the x-axis from 1 to 150. The resulting graph appears in Figure 2.
Figure 2. Power vs. cluster size.
62
From the figure, we can see that the researchers need approximately 16 students per
cluster to achieve power = 0.80.
5.2.3 Scenario 2
The researchers anticipate the probability that a student graduates to be 0.75 in
schools that adopt the new “Stay in School Campaign.” They have a total of 50 students
per school. How many schools do they need from each school to detect their desired
treatment effect? Assume the bounds for the school mean proportion graduating in the
control schools are [0.5, 0.9].
In Scenario 2, the number of clusters is unknown thus we select the power vs.
number of clusters (J) option.
Follow the steps below to answer the questions:
Step 1: Click on Power vs. number of clusters (J).
Step 2: Click on n. Set n = 50.
Step 3: Click on Eφ . Set Eφ =0.75.
Step 4: Click on Cφ . Set Cφ =0.60.
Step 5: Click on PI. Set the lower bound = 0.50 and the upper bound = 0.90. The
resulting graph appears in Figure 3
63
Figure 3. Power vs. number of clusters.
Clicking along the trajectory reveals that 28 clusters are required to achieve power =
0.80. This would mean there would be 14 clusters assigned to treatment and 14 clusters
assigned to control.
5.2.4 Scenario 3
The researchers expect to secure 40 total schools and 50 students per school.
What is the smallest probability of success the researcher can detect for the experimental
group with power = 0.80? Assume the bounds for the school mean proportion graduating
in the control schools are [0.5, 0.9].
In Scenario 3, the probability of success in the treatment group is unknown thus
we select the power vs. probability of success in the treatment group (phi(E)) option.
Follow the steps below to answer the questions:
Step 1: Click on Power vs. probability of success in the treatment group (phi(E)).
64
Step 2: Click on n. Set n = 50.
Step 3: Click on J. Set J = 40.
Step 4: Click on Cφ . Set Cφ =0.60.
Step 5: Click on PI. Set the lower bound = 0.50 and the upper bound = 0.90. The
resulting graph appears in Figure 4.
Figure 4. Power vs. probability of success in the treatment group.
Note that this plot looks different from previous graphs. There is high power for
probabilities that are very different from 0.60, the probability of graduation for the
control group. In other words, big differences in probabilities for treatment and control
are easier to detect. The low power at 0.60 is also logical because if the probability of
graduation was very similar in both groups, the power to detect the difference would be
very low. In our example, we expect to increase the probability of graduation for students
in the treatment group. Thus we look to the right of 0.60 (the probability of graduation for
the control group). Clicking along the trajectory to the right of 0.60, we can see that with
power = 0.80, the researchers can detect a probability of graduation for the treatment
65
group equal to 0.72. In other words, if the researchers believe that the new program will
increase the probability of graduation by at least 0.12, then they have power = 0.80 for
the study.
66
6. Multi-site Cluster Randomized Trials
As discussed in Chapters 1, 2, and 3, researchers designing cluster randomized
trials are often limited in the number of clusters they can afford, resulting in studies that
lack statistical power. In Chapter 2, we investigated the effects of including a cluster-
level covariate in the design and analysis of a cluster randomized trial on the power of the
test. Chapters 6 and 7 explore another method that is commonly used to increase
statistical power in cluster randomized trials, known as blocking. In addition, we
investigate the effects of blocking and including a cluster-level covariate on the power of
the test. Chapter 6 provides a brief conceptual background, and chapter 7 describes how
to use the Optimal Design software to design a cluster randomized trial using blocking.
6.1 Why block?
Blocking is a commonly used technique in experimental design and is frequently
used in individual randomized trials. In this chapter, we extend the idea of blocking to a
cluster randomized trial and focus on the use of pre-randomization blocking to improve
the precision of the estimates and increase the power of the tests. The basic idea of pre-
randomization blocking is to find sites or blocks where clusters within the sites are very
similar with respect to the outcome variable. This reduces the heterogeneity within sites
or “blocks”, increasing the precision of the treatment effect estimate, hence increasing the
power of the test for the main effect of treatment.
To illustrate, imagine that researchers develop a new reading program for
elementary school students. We know that prior school mean test scores, ethnic
composition, and socioeconomic status are related to school mean reading achievement.
We might therefore assign the school to “blocks” that are similar on mean prior test
scores, ethnic compositions, and mean socioeconomic status. Within each block, we
randomize schools to receive the new reading program or the regular program. This
reduces the variance in the estimate of the treatment effect because by dividing schools
into blocks we are able to remove the between-block variance from the error variance.
67
The between-block component is likely to be large, so removing it greatly reduces the
variance of the estimate.
A design using blocking before randomizing groups can be thought of as a multi-
site cluster randomized trial, an extension of the cluster randomized trial. In a multi-site
cluster randomized trial, the site is the block and clusters are randomly assigned to
treatment and control within each site. Sometimes the sites are natural administrative
units, for example, schools where classrooms are randomly assigned to treatment within
schools. The remainder of this chapter and Chapter 6 will refer to a design that utilizes
blocking as a multi-site cluster randomized trial.
Designing a multi-site cluster randomized trial requires that the researcher
calculate the power for the average treatment effect. The power to detect the main effect
of treatment in a multi-site cluster randomized trial is slightly more complicated than in a
cluster randomized trial. In a typical multi-site cluster randomized trial, power is a
function of the minimum detectable effect size, δ , the intra-class correlation, ρ , the
effect size variability, , the number of sites, K, the number of clusters per site, J, and
the cluster size, n, while holding
2δσ
α constant. The effect size variability, , is not
estimable in a cluster randomized trial because we calculate only one effect size, the
difference between the control and experimental groups. However, in a multi-site cluster
randomized trial, the experiment is replicated within sites. This allows us to estimate an
effect size for each site. Thus we are able to estimate the variance of the effect size. For
reasons discussed later in this chapter, it can also be useful to calculate the power for the
variance of the treatment effect. Power for the treatment effect variability is a function of
the intra-class correlation,
2δσ
ρ , the effect size variability, , the number of sites, K, the
number of clusters, J, and the cluster size, n, while holding
2δσ
α constant. Note that ,δ the
standardized main effect of treatment, is not a part of this power calculation. Both tests
are discussed separately following a discussion of the model for a multi-site cluster
randomized trial.
68
In many cases, the sites will be regarded as randomly sampled from a larger
universe or “population” of possible sites. The larger universe is the target of
generalization. For example, if schools are sampled and then classrooms are assigned at
random to treatments within schools, the target of any generalizations will often be the
larger universe of schools from which schools in the study are regarded as a
representative sample.
In other cases, the sites will be regarded as fixed. Consider a program designed to
teach students about the dangers of drugs. The outcome for the study is students’ attitude
towards drugs, which is measured by a questionnaire. The researchers hypothesize that
the school setting - suburban, urban, or rural - affects students’ attitude towards drugs.
Thus they want to block on the setting. In this case, suburban, urban, and rural are not
regarded as sampled from a population of settings, but rather as fixed blocks or sites.
Whether we view sites as fixed or random affects the data analysis and planning
for adequate power to detect the treatment effect. Sections 6.2-6.6 explain how to plan
studies in which sites are regarded as random. Section 6.7 describes how to modify these
procedures for the case in which sites are regarded as fixed.
6.2 The Random Effects Model
We can represent data from a multi-site cluster randomized trial as a three level
model, persons nested within clusters nested within sites. The level-1 model, or person-
level model is:
ijkjkijk eY += 0π (1) ),0(~ 2σNeijk
for persons per cluster, },...,2,1{ ni ∈ },...,2,1{ Jj ∈ clusters and },...,2,1{ Kk ∈ sites,
where jk0π is the mean for cluster j in site k;
is the error associated with each person; and ijke
is the within-cluster variance. 2σ
69
The level-2 model, or cluster-level model, is:
jkjkkkjk rW 001000 ++= ββπ ),0(~0 πτNr jk (2)
where k00β is the mean for site k;
k01β is the treatment effect at site k;
is a treatment contrast indicator, ½ for treatment and -½ for the control; jkW
jkr0 is the random effect associated with each cluster; and
πτ is the variance between clusters within sites.
The level-3 model, or site-level model, is:
kk u0000000 += γβ var 00
~)( 00 βτku
kk u0101001 += γβ var 11
~)( 01 βτku 01
),cov( 0100 βτ=kk uu (3)
where 000γ is the grand mean;
010γ is the average treatment effect (“main effect of treatment”);
is the random effect associated with each site mean; ku00
is the random effect associated with each site treatment effect; ku01
00βτ is the variance between site means;
11βτ is the variance between sites on the treatment effect; and
01βτ is the covariance between site-specific means and site-specific treatment
effects.
The random effects and are typically assumed bivariate normal in
distribution. We are interested in two quantities, the main effect of treatment,
ku00 ku01
010γ , and
the variance of the treatment effect, 11βτ . Note that we are operating under a random
effects model. In a fixed effects model, the variance of the treatment effect,11βτ , would be
0. Section 6.3 focuses on the power for the main effect of treatment for a random effects
model. Section 6.6 discusses power for the treatment effect variability.
70
6.3 Testing the Average Treatment Effect
The average treatment effect is denoted as 010γ in level 3 of the model. Given a
balanced design, it is estimated by
(4) __
010 CE YY −=∧
γ
where is the mean for the experimental group and is the mean for the control
group.
EY_
CY_
Note that the estimated main effect of treatment looks like that in the cluster
randomized trial except that now we are summing over clusters and sites. Thus the
variance of the treatment effect is slightly different than in a cluster randomized trial. It is
estimated by (Raudenbush and Liu 2000)
K
JnVar
/)/(4)ˆ(
2
01011
σττγ πβ ++
= . (5)
The main difference between the variance of the treatment effect in a multi-site cluster
randomized trial and that in a cluster randomized trial is that we now have four sources of
variability, the within-cluster variance, , the between-cluster variance or within-site
variance,
2σ
πτ , the between-site variance, 00βτ , and the between-site variance in the
treatment effect, 11βτ .
If the data are balanced, we can use the results of a nested analysis of variance
with random effects for the clusters and sites and fixed effects for the treatment. Similar
to prior tests, the test statistic is an F statistic. The F test follows a non-central F
distribution, F(1, K-1; )λ . Recall that the noncentrality parameter is a ratio of the
squared-treatment effect to the variance of the treatment effect estimate. Below is the
noncentrality parameter for the test.
JnK
/)/(4)var(2
2010
010
2010
11σττ
γ
γ
γλπβ ++
== ∧ . (6)
71
Recall that the larger the non-centrality parameter, the greater the power of the
test. By looking at the formula, we can see that K, the number of sites, has the greatest
impact on the power. It is especially important to have a large K if there is a lot of
between-site variance. Increasing J also increases the power but is not as important as K.
J becomes more important if there is a lot of variability between clusters. Finally,
increasing n does increase the power, but has the smallest effect of the three sample sizes.
Increasing n is most beneficial if there is a lot of variability within clusters. In addition to
K, J, and n, a larger effect size increases power. Note that11βτ , the between-site variance
of the treatment effect, appears in the denominator of the non-centrality parameter. As
mentioned above, if the variance of the treatment effect across sites is large, it is
particularly important to have a large number of sites to counteract the increase in
variance in order to achieve adequate power. However, if the variability of the impact
across sites is very large, the average treatment effect may not be informative. Section 5.6
discusses the importance of the variance of the treatment effect.
Thus far, we have focused on the unstandardized random effects model for a
multi-site cluster randomized trial. However, we know that researchers often talk in terms
of standardized effect sizes and standardized effect size variability. Recall in chapter 1,
we adopted Cohen’s definition for standardized effect sizes, with 0.20, 0.50, and 0.80 as
small, medium, and large effect sizes. We continue with these same rules of thumb in this
chapter. In a multi-site cluster randomized trial, we also need to standardize the variance
of the effect size. The magnitude of the effect size variability depends on the desired
minimum detectable effect. For example, an effect size variance of 0.10 is the same as a
standard error of approximately =10.0 0.31. If a researcher desires a minimum
detectable effect of 0.20, a standard error of 0.31 is too large and would indicate a lot of
uncertainty in the estimate. For an effect size of 0.20, an effect size variance of 0.01 (or
standard error of 0.10) is more reasonable. Let’s see how we translate the unstandardized
model into a standardized model.
72
In the standardized model, the within-cluster variance, , and the between-
cluster variance,
2σ
πτ , sum to 1. The intra-cluster correlation, ρ , is defined as
2σττ
ρπ
π
+= . Since , we can rewrite 12 =+ στ π ρτ π = and . This notation is
the same as the notation for the standardized model for the cluster randomized trial. It is
important to recognize that in the multi-site cluster randomized trial we remove the
between-block variability so
ρ−= 1σ 2
ρ is the between cluster variance relative to the total
variance within blocks. However, in the program we ask the use to specify the intra-class
correlation, standardized effect size, and effect size variability prior to blocking as well as
the percentage of variance explained by blocking in order to simplify the calculations for
the user.
In standardized notation, the non-centrality parameter, ,λ can be rewritten as:
JnK
/]/)1([42
2
ρρσδλ
δ −++= (7)
where the intra-cluster correlation, ρ , is:
2σττ
ρπ
π
+= ,
or the variance between clusters relative to the between and within cluster variation
within blocks; δ is the standardized main effect of treatment,
2
010
στ
γδ
π +=
and is the variance of the standardized treatment effect, 2δσ
22 11
σττ
σπ
βδ +
= .
It is important to be familiar with the standardized model because it is common
among researchers to use the standardized values and the Optimal Design Software
operates with the standardized notation which requires the researcher be able to identify
,,δρ and Note that power now depends on n, J, K, .2δσ ,ρ ,δ and .2
δσ
73
6.4 Including a Cluster-level Covariate
In addition to blocking, researchers may also have cluster-level covariates
available. The cluster-level covariate in a multi-site randomized trial functions similarly
to the cluster-level covariate in a cluster randomized trial discussed in Chapters 2. Recall
that including a cluster-level covariate influences the power of the test depending on the
strength of the correlation between the covariate and the true cluster mean outcome, or
how much of the variability in the true cluster mean outcome is explained by the
covariate. The proportion of explained variability is denoted . The larger , the
smaller the conditional level 2 variance,
2R 2R
x|πτ , relative to the unconditional level 2
variance, πτ , and the greater the benefit of the covariate in increasing precision and
power.
6.5 The Models and Treatment Effect Estimates
The level 1 model for a multi-site cluster randomized trial with a cluster-level
covariate looks the same as the level-1 model for a regular multi-site cluster randomized
trial (see equation 1). The level 2 model looks different because it includes the cluster
level covariate. It is written as:
jkjkkjkkkjk rXW 00201000 +++= βββπ ),0(~ |0 xjk Nr πτ (8)
Note: ππ ττ )1( 2| Rx −=
where k00β is the adjusted mean for site k;
k01β is the adjusted treatment effect at site k;
k02β is the regression coefficient for the cluster-level covariate at site k;
jkW is 0.5 for treatment and –0.5 for control;
jkX is the cluster level covariate, typically centered to have mean 0;
jkr0 is the random effect associated with each cluster; and
x|πτ is the residual variance conditional on the cluster-level covariate . jkX
74
Note that the between cluster variance is now a residual variance conditional on the
cluster-level covariate . jkX
The level 3 model is now:
kk u0000000 += γβ ),0(~ |00 00 xk Nu βτ (9)
kk u0101001 += γβ ),0(~1101 βτNu k
02002 γβ =k
where 000γ is the grand mean;
010γ is the average treatment effect (“main effect of treatment”);
020γ is the regression coefficient for the cluster-level covariate, which is assumed
constant across sites;
ku00 is the random effect associated with each site mean;
ku01 is the random effect associated with each site treatment effect;
x|00βτ is the residual variance between site means; and
11βτ is the variance between sites on the treatment effect.
Because of the randomization, the true treatment effect is not influenced by the
covariate. Thus it is not necessary to have a conditional variance for the between-site
variation in the treatment effect. Note that we are also fixing the average regression
coefficient for the cluster-level covariate.
The estimate of the main effect of the treatment accounting for the cluster-level
covariate is:
. (10) )(ˆˆ__
020
__
010 CECE XXYY −−−= γγ
In words, it is the mean difference adjusted for the treatment group differences on the
covariate. To test the main effect of treatment we use an F-statistic which follows a non-
central F distribution, );2,1( xKF λ− where:
Jn
K
xx /)/(4 2
|
2010
11σττ
γλ
πβ ++= (11)
75
This formula for the noncentrality parameter looks similar to the noncentrality parameter
without the covariate except that the estimate of treatment effect is calculated differently
and the between cluster variance is now a conditional variance.
Following the same logic as the multi-site cluster randomized trial with no
covariate, it is important to standardize the model. The non-centrality parameter
expressed in standardized notation is:
[ ] JnK
x //)1(42
2
∗∗∗
∗
−++=
ρρσδλ
δ
(12)
where the intra-cluster correlation, *ρ
2|
|*
σττ
ρπ
π
+=
x
x ,
or the conditional variance between clusters relative to the between and within cluster
variation within blocks; is the standardized main effect of treatment conditional on
the covariate,
*δ
2
|
010*
στ
γδ
π +=
x
;
and is the variance of the standardized treatment effect conditional on the covariate, *2δσ
2|
*2 11
σττ
σπ
βδ +
=x
.
Because the conditional standardized quantities resulting from inclusion of a
covariate are frequently unknown, the program asks the user to enter the unconditional
parameters, ,ρ ,δ and . The program calculates the conditional standardized values
based on the input.
2δσ
6.6 Testing the Variance of the Treatment Effect
Recall that in a fixed effects model we assume the treatment effect to be
homogeneous across the sites. Thus the tests described in this section are only applicable
76
under the random effects model where we assume the treatment effect differs across the
sites. To quantify this difference, we estimate the variance of the treatment effect across
the sites. The design, with treatments randomized to clusters within sites, allows us to
estimate this variability. If it is very large, it may be hiding the true treatment effect. For
example, imagine a multi-site cluster randomized trial that reports a standardized
treatment effect estimate of 0.23. The researchers claim that the new reading program
improves scores by 0.23 standard units. However, they fail to report that the standardized
treatment effect variability across sites is 0.30. The high variance may be hiding the fact
that some types of schools benefit from the program while other types of schools actually
suffer from the program. For example, there may be a differential effect by location,
where rural schools that adopt the program see positive effects but urban schools that
adopt the program see negative effects. Thus the researchers would need to investigate
moderating site characteristics. Reporting the average treatment effect alone may be very
misleading and is not recommended.
Because the variance of the treatment effect is critical in determining the
interpretation of a treatment effect estimate, it is important to be able to detect the
treatment effect variability with adequate power. The remainder of this section describes
how to calculate the power for the variance of the treatment effect. We will use the
standardized model notation since it is more common in practice and is required for the
Optimal Design software.
The null and alternative tests for the treatment effect variability are:
0: 20 =δσH
0 . : 21 >δσH
The null hypothesis states that the variance of the treatment impact across sites is null,
whereas the alternative hypothesis states that it is greater than 0. The test for the variance
of the treatment effect is an F test. The F statistic is:
77
Jn
JnF
/)1(4
/)1(42
∧∧
∧∧∧
−+
−++
=ρρ
ρρσ δ
. (13)
Note that the average effect size is not a part of the calculation, thus the power is based
on the number of sites, K, the number of clusters per site, J, the number of people per
cluster, n, the standardized effect size variability, and the intra-cluster correlation, ,2δσ .ρ
The F statistic follows a central F distribution with df = K-1, K(J-2). The ratio of the
expectation of the numerator to the expectation of the denominator is
]/)1([4
12
nJ
ρρσ
ω δ
−++= . (14)
Under the null hypothesis, we expect to be 0, thus2δσ 1=ω . As increases, 2
δσ ω gets
larger, which increases the power of the test. Thus the number of clusters within each site
is critical for increasing the power to detect the variance of the treatment effect across
sites. As the number of clusters within each site increases, so does the power to detect the
variability of treatment effects. Increasing K also increases the power, through the
degrees of freedom, but is not as important as increasing J. Note that this is the opposite
of what we found in the case of power for the treatment effect, where K is the most
significant factor in increasing power and J is less important.
Looking at equation 14, we can see that it will be difficult to achieve adequate
power to detect small values of , like 0.01 unless J is extremely large, which is
unlikely. This is not very problematic however, because our primary concern is to be able
to detect larger treatment effect variability since small values will not influence the
interpretation of the treatment effect.
2δσ
6.7.1 Example 1
To illustrate the use of blocking consider the two examples below. Researchers
have developed a new drug prevention program and are eager to test the effectiveness of
78
the program. The researchers decide to block on location. They choose 10 large
metropolitan cities across the United States from a population of large cities, and within
each city randomly assign schools to either the new drug prevention program or the
regular program. Assume 10 schools participate within each site and 50 students within
each school. Assume that the intra-class correlation before blocking, 15.0=ρ , and the
effect size variability, . Blocking accounts for 50% of the variation in the
outcome variable. The researchers want to be able to detect a minimum treatment effect
of 0.20. What power do the researchers have to detect this treatment effect? Assume that
the researchers also have access to a cluster level covariate that explains 49% of the
variability in cluster-mean outcomes after accounting for blocking. How does this change
the power to detect the treatment effect? Ignoring the covariate, what power do the
researchers have to detect the variance of the treatment effect?
01.02 =δσ
Below is a plot from the Optimal Design software with the specifications listed in
the example with no cluster level covariate. Note that the number of sites is allowed to
vary along the x-axis as a function of the power.
Figure 1. Power vs. Number of Sites (Power for Treatment Effect).
Number of sites
Power
8 11 14 17 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 50 J = 10
δ= 0.20,ρ= 0.15,σ2δ=0.010,Β=0.50
First, note that the legend in the upper right corner matches the specifications set forth in
the example. Clicking along the trajectory at K=10 sites reveals that the power is
79
approximately 0.75. Let’s see what happens when we include a cluster-level covariate
that explains 49% of the variation in the cluster-level outcome. Figure 2 displays the
trajectory with and without the cluster-level covariate.
Figure 2. Power vs. Number of Sites (Power for Treatment Effect).
Number of sites
Power
0.3
0.2
0.1
8 11 14 17 20
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 50 J = 10
δ= 0.20,ρ= 0.15,σ2δ=0.010,Β=0.50
δ= 0.20,ρ= 0.15,σ2δ=0.010,Β=0.50,R2
L2=0.49
Clicking along the dotted trajectory reveals for K=10 reveals power is approximately
0.87. Thus we can conclude that including the cluster-level covariate increases the power
to an adequate level to detect the main effect of treatment.
We can also plot the power for the variance of the treatment effect. The plot does
not include the cluster-level covariate. Figure 3 displays the plot for the variance of the
treatment effect.
80
Figure 3. Power vs. Number of Sites (Power for Effect Size Variability).
Number of sites
Power
7 10 13 16 19
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 50 J = 10
ρ = 0.15,σ2δ=0.010, Β=0.50
ρ = 0.15,σ2δ=0.100, Β=0.50
Clicking along the solid trajectory reveals that for 10 sites, the power to detect the
treatment effect variability is 0.14. Note that the very small effect size variability, 0.01,
makes it difficult to detect. The second trajectory on the graph sets the effect size
variability equal to 0.10. The remaining constraints are the same but the power to detect
the treatment effect is 0.83. In other words, as the size of the treatment effect variability
increases, the power to detect it increases dramatically.
6.7.2 Example 2
The second example compares a cluster randomized trial to a multi-site cluster
randomized trial with respect to the power to detect the treatment effect. Imagine a team
of researchers develops a new math program for 4th graders. They propose that students
who participate in the new program will have increased math achievement. The outcome
is math score on a specific math test at the completion of 4th grade. They propose two
designs and want to know which design will give them the most power to detect the
treatment effect.
Design 1 – Cluster Randomized Trial: The first design is a cluster randomized
trial. They plan to randomly assign 40 schools to either treatment or control. Within each
school, they plan to test 100 students. Based on past research, the researchers estimate an
81
intra-class correlation of 0.10. They want to be able to detect a minimum effect size of
0.25. What is the power of the test under this design?
Design 2 – Multi-Site Cluster Randomized Trial: The second design is a multi-site
cluster randomized trial. Based on past studies, the researchers know that the percent of
children in a school on free and reduced lunch is strongly related to achievement. The
researchers obtain 10 sites, blocked on the percent of students on free and reduced lunch.
Within each site they randomly assign 2 schools to treatment and 2 schools to control.
They still test 100 students within each school. Research indicates that blocking on
percent of children on free and reduced lunch reduces the between-school variation by
64%. Assuming the variability in effect sizes is small, 0.01, and the intra-class
correlation, 15.0=ρ , what is the power to detect a treatment effect of 0.25?
Design 1: Plugging the information into the Cluster Randomized Trial option, we
get the graph in Figure 4.
Figure 4. Power vs. number of clusters.
Number of clusters
Power
23 42 61 80 99
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 100
δ= 0.25,ρ = 0.10
Clicking on the trajectory at J=40 reveals that the power to detect an effect is 0.64. This
is not a very powerful design so let’s see how design 2 compares.
82
Design 2. Plugging the information into the Multi-Site Cluster Randomized Trial
option, we get the plot in Figure 5.
Figure 5. Power vs. number of sites (Power for Treatment Effect).
Number of sites
Power
8 11 14 17 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 100 J = 4
δ= 0.25,ρ= 0.10,σ2δ=0.010,Β=0.64
We can see that the larger the number of sites, the greater the power. Clicking along the
trajectory we can see that for 10 sites, the power is approximately 0.86. Note that 10 sites
with 4 schools at each site results in a total of 40 schools. Thus we use 40 schools in both
designs but we are able to increase the power from 0.64 to 0.86 by blocking on percent of
students on free and reduced lunch.
6.8 The Fixed Effects Model
Recall that we can represent data from a multi-site cluster randomized trial as a
three level model, persons nested within clusters nested within sites. The fixed effects
model is identical to the random effects model with a crucial exception: the site-specific
contributions and are designated as fixed constants rather than random variables. ku00 ku01
The level-1 and level-2 models are identical to models (1) and (2) in the random
effects case. The level-3 model, or site-level model is:
kk u0000000 += γβ
kk u0101001 += γβ (15)
where 000γ is the grand mean;
010γ is the average treatment effect (“main effect of treatment”);
83
ku00 , for , are fixed effects associated with each site mean,
constrained to have a mean of zero; and
},...,2,1{ Kk ∈
ku01 , for , are fixed effects associated with each site treatment
effect, constrained to have a mean of zero.
},...,2,1{ Kk ∈
We are interested in two kinds of quantities, the main effect of treatment, 010γ , and
the fixed treatment-by-site interaction effects , for ku01 },...,2,1{ Kk ∈ .
6.9 Testing the Average Treatment Effect
If the data are balanced, we can use the results of a nested analysis of variance
with random effects for the clusters and fixed effects for sites, treatments, and site-by-
treatment interaction. Similar to prior tests, the test statistic is an F statistic. The F test
follows a non-central F distribution, F(1, K(J-2); )λ . Recall that the noncentrality
parameter is a ratio of the squared-treatment effect to the variance of the treatment effect
estimate. Below is the noncentrality parameter for the test.
)/(4 2
2010
nKJ
στγ
λπ +
= . (16)
Recall that the larger the non-centrality parameter, the greater the power of the
test. By looking at the formula, we can see that KJ, the total number of clusters, has the
greatest impact on the power. Finally, increasing n does increase the power, but has the
smallest effect of the three sample sizes. Increasing n is most beneficial if there is a lot of
variability within clusters. In addition to K, J, and n, a larger effect size increases power.
Note that unlike the case of the random effects model,11βτ , the variance of the treatment
effect, does not appear in the denominator of the non-centrality parameter. However, if
the variation of the treatment effects across sites is large, the average treatment effect
may not be informative. Section 6.11 discusses the test of the variation site by treatment
effect variation under the fixed effects model. If the treatment effects vary across sites
with a fixed effects model, the main effect of treatment is interpreted with great caution.
84
In the fixed effects standardized model, the within-cluster variance, , and the
between-cluster variance,
2σ
πτ , sum to 1. The intra-cluster correlation, ρ , is defined as
2σττ
ρπ
π
+= . Since , we can rewrite 1=+τ π
2σ ρτ π = and . This notation is
the same as the notation for the standardized model for the cluster randomized trial. Note
that
ρσ =2 −1
ρ is the between cluster variance relative to the total variance within blocks. The
non-centrality parameter, ,λ can be rewritten in terms of the standardized model:
]/)1([4
2
nKJ
ρρδλ−+
= (17)
where ρ is the intra-cluster correlation ,
2σττ
ρπ
π
+= ,
or the variance between clusters relative to the between and within cluster variation
within blocks; and δ is the standardized main effect of treatment,
2
010
στ
γδ
π += .
6.10 Example
Let us compare the power to detect the treatment effect for a cluster randomized
trial to a multi-site cluster randomized trial with fixed effects. Consider a program
designed to teach students about the dangers of drugs. The researchers propose that
students who participate in the program will have a more positive attitude towards
staying away from drugs. The outcome is students’ attitudes towards drugs, which is
measured on a continuous scale. The researchers propose two designs and want to know
which design will give them the most power to detect the treatment effect.
Design 1 – Cluster Randomized Trial: The first design is a cluster randomized
trial. They plan to randomly assign schools to either treatment or control. Within each
85
school, they plan to test 100 students. Based on past research, the researchers estimate an
intra-class correlation of 0.15. They want to be able to detect minimum effect size of
0.25. How many schools are necessary to achieve power = 0.80?
Design 2 – Multi-Site Cluster Randomized Trial: The second design is a multi-site
cluster randomized trial. Based on past studies, the researchers know that the school
setting is strongly related to attitude towards drugs. Research indicates that blocking on
school setting, suburban, urban, or rural, reduces the between-school variation by 33%.
Within each school, the researchers test 100 students. How many schools are necessary
for each of the three sites if the researchers want to detect a minimum treatment effect of
0.25 with power = 0.80? Assume 15.0=ρ .
Design 1: Plugging the information into the Cluster Randomized Trial option, we
get Figure 6.
Figure 6. Power vs. number of clusters.
Number of clusters
Power
23 42 61 80 99
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 100
δ= 0.25,ρ = 0.15
In order to achieve power = 0.80, the researchers need to randomize a total of
approximately 82 schools.
Design 2: Figure 7 displays the power curve for the fixed effects multi-site cluster
randomized trial.
Figure 7. Power vs. number of clusters per site (Power for Treatment Effect).
86
Number of clusters/site
Power
14 20 26 32 38
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 100 K = 3
δ= 0.25,ρ= 0.15,σ2δ=0.000,Β=0.33
Curves with σ2δ set to 0.0 are
curves for the fixed effects model. Clicking along the power curve we can see that approximately 19 schools per site are
required to achieve power = 0.80. This is a total of 19*3, or 57 schools. By blocking, we
reduce the number of schools required to achieve power = 0.80 by 25 (82-57).
While the fixed effects model affords extra power for testing the main effect of
treatment, the interpretation of such a main effect requires great caution when treatment
effects vary across sites. We now turn to the question of testing site-by-treatment
variation within the fixed effects model.
6.11 Testing Site-by-Treatment Variation in the Context of a Fixed Effects Model.
Operationally, the test of the site-by-treatment variation in the case of the fixed
effects model is identical to that in the case of the random effects model (see Section 6.6
“Testing the Variance of the Treatment Effect”). The null hypothesis, however, differs.
Recall that in the case of the random effects model we test
0: 110 =βτH
or for the standardized random effects model, we test
0: 20 =δσA .
However, in the fixed effect model, the site-specific treatment effects are fixed
constants rather than random variables. Thus we have, in the non-standardized model
87
0: 201
10 =∑
−k
K
kuH .
As in the random effects case, we test this hypothesis using
cellwithinMSsitebytreatmentsMSJKKF =−− )]2(,1[ .
When the F test indicates rejection of , one emphasizes the estimation of site-
specific treatment effects (also known as “simple main effects” – see Kirk (1982), p. 365)
or post hoc procedures designed to identify subsets of sites for which the treatment effect
is homogeneous (see Kirk (1982), p. 317).
0H
88
7. Using the Optimal Design Software for Multi-site Cluster Randomized Trials
This chapter focuses on how to use the OD software to design a multi-site cluster
randomized trial. Section 7.1 provides general information for how to use the multi-site
cluster randomized trial software. Sections 7.2 and 7.3 provide examples and details
about the options within the multi-site cluster randomized trials option for the case of
random site effects. Section 7.4 provides an example in the case of fixed site effects.
7.1 General Information
The multi-site cluster randomized trials option allows the researcher to calculate
the power for the average treatment effect, the effect size variability, and the cluster
effect. Below is the menu:
Multisite Cluster Randomized Trials
Power for Treatment Effect
Power vs. cluster size (n)
Power vs. number of sites (K)
Power vs. clusters per site (J)
Power vs. intra-class correlation (rho)
Power vs. effect size (delta)
Power vs. effect size variability (sigma)
Power vs. proportion of explained variation by level 2 covariate (R2)
Power for Effect Size Variability
Power vs. cluster size (n)
Power vs. number of sites (K)
Power vs. clusters per site (J)
Power vs. intra-class correlation (rho)
Power vs. effect size variability (sigma)
89
7.2 Power for the Average Treatment Effect (Random Effects Model)
The first option is power for the treatment effect. We know that the power for the
treatment effect in a random effects model is a function of the cluster size, n, the number
of clusters, J, the number of sites, K, the intra-class correlation, ρ , and the effect size
variability, , the effect size, 2δσ δ , and the proportion of explained variation by level 2
covariate, denoted in the program. Thus we can calculate the power as a function of
any one of these components while holding the others constant. Section 7.2.1 provides an
example that will be used to illustrate how to use the software to calculate the power for
the treatment effect in a random effects model.
22LR
7.2.1 Example
The example below was introduced in Chapter 2 and is modified in this chapter.
Suppose a team of researchers develop a new literacy program. The founders of the new
program propose that students who participate in the program will have increased reading
achievement. They propose a three-level design with students nested within classrooms
within schools. In other words, they want to block on school. By blocking on school, the
researchers expect to explain 40% of the variation in the outcome variable. They plan to
test students who are in classrooms that participate in the new program (experimental
group) and students who are in classrooms that participate in the regular program (control
group) in each of the schools using a reading test to determine if students using the new
program score higher. However, they are unsure how to proceed with respect to the
number of students they should test in each classroom, the number of classrooms in each
school, and the number of schools in order to conduct a trial with power = 0.80. Five
scenarios the researcher might encounter are presented below. Assume 05.0=α for each
case.
7.2.2 Scenario 1
Based on past studies, the researchers estimate 15.0=ρ , , and want to
be able to detect a minimum standardized effect size of 0.20. Assuming 15 schools are
willing to participate as well as 10 classrooms within each school, how many students
within each classroom are necessary to achieve power = 0.80? What if the researchers
01.02 =δσ
90
include a cluster-level covariate that explains 49% of the variation in the cluster level
mean. How many children per classroom are necessary to achieve power = 0.80?
In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size
(n) option. Figure 1 displays the screen.
Figure 1. Multi-Site Cluster Randomized Trial Screen.
Let’s take a closer look at the function of each of the buttons on the toolbar.
α - specifies the significance level, or chance of a Type I error. By default, α is set at
0.05, which is a common level for most designs.
K – specifies the number of sites. By default, K is set at 12.
J – specifies the number of clusters within each site. By default, J is set at 10.
δ - specifies the minimum effect size of interest. Note this is the expected effect size
before blocking. By default, the minimum effect size is set at 0.20. 2δσ - specifies the effect size variability. By default, it is set at 0.01.
ρ - specifies the intra-class correlation before blocking. Thus it is the between-cluster
plus between-block variance divided by the within-cluster plus between-cluster plus
91
between-block variance. By default, it is set at 0.01 and 0.10. Trajectories for both intra-
class correlations are plotted so they can be compared.
B – specifies the proportion of variance explained by the blocking variable. By default, it
is set to 0. 2
2LR - specifies the correlation between the cluster-level covariate and the cluster mean
outcome. By default, it is set to 0.
The remaining options on the toolbar are the same as those in the Cluster Randomized
Trial option, which are explained in detail in Chapter 3.
Now let’s use the software to explore the question in Scenario 1. To answer the
question, click on Power vs. cluster size (n). Then move along the toolbar and specify
K=15, J=10, ,20.0=δ , 01.02 =δσ 15.0=ρ , B = 0.40, and = 0.00 and 0.49. This
will allow us to see the plot with and without the cluster-level covariate. Figure 2 displays
the screen that appears.
22LR
Figure 2. MSCRT - Power vs. cluster size.
92
Clicking on the solid trajectory reveals that with 17 students per classroom, the power is
0.80. However, if we include the cluster-level covariate, we only need 9 students per
classroom to achieve power = 0.80.
Note that the power does not approach 1.0 for either of these trajectories because
power does not approach 1.0 as the sample size per cluster is increased. Additional effect
sizes, effect size variability, and intra-class correlations could also be specified to look at
a variety of trajectories on one screen.
7.2.3 Scenario 2
Based on past studies, the researchers estimate 15.0=ρ , , and want to
be able to detect a minimum standardized effect size of 0.20. Assuming 10 classrooms
within each school are willing to participate and 25 students within each classroom, how
many schools are necessary to achieve power = 0.80? What if the researchers include a
cluster-level covariate that explains 49% of the variation in the cluster level mean. How
many schools are necessary to achieve power = 0.80?
01.02 =δσ
In Scenario 2, the number of sites is unknown thus we select the power vs.
number of sites (K) option. This will allow the number of sites to vary along the x-axis.
As a result, K will be replaced on the toolbar by the cluster size (n) icon. The remaining
options function as previously described. Now let’s use the software to revisit Scenario 2.
To answer the question, click on Power vs. number of sites (K). Then move along
the toolbar and specify n=25, J=10, ,20.0=δ ,01.02 =δσ 15.0=ρ , B = 0.40, and
=0.00 and 0.49. Figure 3 displays the screen that appears. 22LR
Figure 3. MSCRT - Power vs. number of sites.
93
Clicking along the solid trajectory reveals that 14 sites, or schools, are necessary to
achieve power = 0.80. When the cluster level covariate is included in the design, clicking
along the dotted line trajectory reveals that only 10 schools are necessary to achieve
power = 0.80. Note that unlike the cluster size, as the number of sites increases, the
power tends towards 1. In other words, the number of sites is very important for
increasing the power. This corresponds to the information presented in Chapter 5. Note
that without the covariate, 14 x 10 = 140 classrooms are necessary and with the covariate
10 x 10 = 100 classrooms are necessary to achieve power = 0.80.
7.2.4 Scenario 3
Based on past studies, the researchers estimate 15.0=ρ , , and want to
be able to detect a minimum standardized effect size of 0.20. Assuming 15 schools are
willing to participate as well as 25 students per classroom, how many classrooms are
necessary to achieve power = 0.80? What if the researchers include a cluster-level
covariate that explains 49% of the variation in the cluster level mean. How many
classrooms are necessary to achieve power = 0.80?
01.02 =δσ
In Scenario 3, the number of clusters, or classrooms, is unknown thus we select
the power vs. number of clusters (J) option. Again, the only change in the toolbar is that J
94
no longer appears since it is allowed to vary along the x-axis. Now let’s explore Scenario
3 using the software.
To answer the question, select Power vs. number of clusters (J) from the menu.
Then move along the toolbar and specify n=25, K=15, ,20.0=δ , 01.02 =δσ 15.0=ρ ,
B=0.40, and = 0.00 and 0.49. Figure 4 displays the new screen with the x-axis
adjusted to min=2.0 and max = 20.0.
22LR
Figure 4. MSCRT - Power vs. number of clusters per site.
Clicking along the solid trajectory reveals that 9 classrooms per school will achieve the
desired power of 0.80. When the cluster-level covariate is included, only 6 classrooms
per school are necessary. Thus in the case of no covariate, a total of K x J, or 15 x 9 =
135 classrooms are necessary to achieve the desired power. A total of 15 x 6 = 90
classrooms are necessary when the cluster level covariate is included.
7.2.5 Scenario 4
Based on past studies, the researchers estimate and want to be able to
detect a minimum standardized effect size of 0.20. Assuming 15 schools are willing to
01.02 =δσ
95
participate, 10 classrooms within each school, and 25 students per classroom, what value
of the intra-class correlation results in power = 0.80? What if the researchers include a
cluster-level covariate that explains 49% of the variation in the cluster level mean. What
value of the intra-class correlation results in power = 0.80?
In Scenario 4, the value of the intra-class correlation is unknown thus we select
power vs. intra-class correlation ( ρ ). Again, the only change in the toolbar is that ρ no
longer appears since it is allowed to vary along the x-axis. Now let’s explore Scenario 4
using the software.
To answer the question, click power vs. intra-class correlation ( ρ ). Then move
along the toolbar and specify n=25, K=15, J=10, ,20.0=δ 01.02 =δσ , B = 0.40, and =
0.00 and 0.49. Figure 5 displays the new screen.
22LR
Figure 5. MSCRT - Power vs. intra-class correlation.
Clicking along the solid trajectory reveals that ρ of approximately 0.15 results in power
= 0.80. Clicking along the dotted trajectory reveals that ρ of approximately 0.25
achieves power = 0.80. Note that for any single trajectory, as the intra-class correlation
increases, the power of the test decreases.
96
7.2.6 Scenario 5
Based on past studies, the researchers estimate 15.0=ρ and . Assuming
15 schools are willing to participate, 10 classrooms within each school, and 25 students
per classroom, what is the minimum detectable effect size that results in power = 0.80?
What if the researchers include a cluster-level covariate that explains 49% of the
variation in the cluster level mean. What is the minimum detectable effect size that
results in power = 0.80?
01.02 =δσ
In Scenario 5, the effect size is unknown thus we select the power vs. effect size
option. Again, the only change in the toolbar is that δ no longer appears since it is
allowed to vary along the x-axis. Now let’s explore Scenario 5 using the software.
To answer the question, click on power vs. effect size. Then move along the
toolbar and specify n=25, K=15, J=10, , 01.02 =δσ 15.0=ρ , B = 0.40, and = 0.00 and
0.49. Figure 6 displays the new screen.
22LR
Figure 6. MSCRT - Power vs. effect size.
Clicking along the solid trajectory reveals that an effect size of approximately 0.19 results
in power = 0.80. Clicking along the dotted trajectory reveals that by including a cluster-
97
level covariate that explains 49% of the variation in the cluster level outcome, we can
detect an effect size of approximately 0.16.
7.2.7 Scenario 6
Based on past studies, the researchers estimate 15.0=ρ and want to be able to
detect a minimum standardized effect size of 0.20. Assuming 15 schools are willing to
participate, 10 classrooms within each school, and 25 students per classroom, what is the
minimum effect size variability that results in power = 0.80? What if the researchers
include a cluster-level covariate that explains 49% of the variation in the cluster level
mean. What is the minimum effect size variability that results in power = 0.80?
In Scenario 6, the effect size variability is unknown thus we select the power vs.
effect size variability ( ) option. Again, the only change in the toolbar is that no
longer appears since it is allowed to vary along the x-axis. Now let’s explore Scenario 6
using the software.
2δσ 2
δσ
To answer the question, click power vs. effect size variability ( ). Then move
along the toolbar and specify n=25, K=15, J=10,
2δσ
,20.0=δ 15.0=ρ , B = 0.40, and =
0.00 and 0.49. Figure 7 displays the new screen.
22LR
Figure 7. MSCRT - Power vs. effect size variability.
98
Clicking on the solid trajectory, we can achieve power = 0.80 with an effect size
variability of 0.016. However, if we include the cluster level covariate, we can achieve
power = 0.80 with an effect size variability of 0.038.
7.2.8 Scenario 7
Based on past studies, the researchers estimate 15.0=ρ and and want
to be able to detect a minimum standardized effect size of 0.20. Assume 15 schools are
willing to participate, 10 classrooms within each school, and 25 students per classroom.
What proportion of explained variation by the cluster-level covariate results in power =
0.80?
01.02 =δσ
In Scenario 7, a cluster level covariate is available, but the proportion of
explained variation by the covariate is unknown. Thus, we select the power vs. proportion
of explained variation by level 2 covariate option. Again, the only change in the toolbar is
that no longer appears since it is allowed to vary along the x-axis. Now let’s explore
Scenario 7 using the software.
22LR
99
To answer the question, click power vs. cluster level covariate correlation. Then
move along the toolbar and specify n=25, J=10, K=15, ,20.0=δ ,01.02 =δσ 15.0=ρ , and B
= 0.40. Figure 8 displays the new screen.
Figure 8. MSCRT - Power vs. cluster level covariate correlation.
We can see that even without the cluster level covariate, the design has power = 0.80.
Inclusion of the cluster level covariate increases the power to greater then 0.80.
7.3 Power for effect size variability
Thus far we have focused on power calculations for the treatment effect in a
random effects model. Researchers may also be interested in the power for effect size
variability. Recall that if the effect size variability is large, the treatment effect may be
meaningless and it is important to investigate moderating effects to explain the variability
in effect sizes. As a result, it is important to be able to detect the effect size variability
with adequate power. Similar to the power for detecting the treatment effect, the power to
detect effect size variability is a function of the cluster size, n, the number of clusters, J,
the number of sites, K, and the intra-class correlation, .ρ The main difference is that the
effect size, ,δ does not impact the power to detect treatment variability. Section 7.3.1
100
provides an example that will be used to illustrate how to use the software to calculate the
power for the treatment effect.
7.3.1 Example
The example is a continuation of the example in Section 7.2.1. Recall that there is
a team of researchers who develop a new literacy program. The founders of the new
program propose that students who participate in the program will have increased reading
achievement. They propose a three level design with students nested within classrooms
nested within school. They expect blocking by school will explain 40% of the variability
in the outcome. They plan to test students who are in classrooms that participate in the
regular program (control group) and students who are in classrooms that participate in the
new program (experimental group) in each of the participating schools using a reading
test. In addition to determining the power for the treatment effect, they also want to know
the power to detect the variability in the effect sizes across sites. Five scenarios the
researchers might encounter are presented below. Assume 05.0=α for each case.
7.3.2 Scenario 1
Based on past studies, the researchers estimate 15.0=ρ Assuming 15 schools are
willing to participate as well as 10 classrooms per school, how many students within each
classroom are necessary to detect an effect size variability of 0.10 with power = 0.80?
In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size
(n) option. The toolbar at the top of the screen is the same as the toolbar described in
Section 6.2.2, except there is noδ or effect size button. Thus detailed descriptions of the
buttons are not provided in this section.
To answer the question, click on Power vs. cluster sized (n). Then move along the
toolbar and specify K=15, J=10, ,10.02 =δσ 15.0=ρ , and B = 0.40. Figure 9 displays the
new screen.
Figure 9. Power vs. cluster size.
101
Clicking along the trajectory reveals that 14 students per classroom are required to
achieve power = 0.80.
7.3.3 Scenario 2
Based on past studies, the researchers estimate 15.0=ρ Assuming 15 classrooms
within each school are willing to participate as well as 25 students per classroom, how
many schools are necessary to detect an effect size variability of 0.10 with power = 0.80?
In Scenario 2, the number of sites is unknown thus we select the power vs.
number of sites (K) option. This allows the number of sites to vary along the x-axis.
To answer the question, click on Power vs. number of sites (K). Then move along
the toolbar and specify n=25, J=10, ,10.02 =δσ 15.0=ρ , and B = 0.40. Figure 10 displays
the result.
Figure 10. Power vs. number of sites.
102
Clicking along the trajectory reveals that 12 schools are necessary to detect an effect size
variability of 0.10 with power = 0.80. Note that as the number of sites increases, the
power increases.
7.3.4 Scenario 3
Based on past studies, the researchers estimate 15.0=ρ Assuming 15 schools are
willing to participate as well as 25 students within each classroom, how many classrooms
are necessary to detect an effect size variability of 0.10 with power = 0.80?
In Scenario 3, the number of clusters in unknown thus we select the power vs.
number of clusters (J) option. This allows the number of clusters to vary along the x-axis.
To answer the question, click on Power vs. number of clusters (J). Then move
along the toolbar and specify n=25, K=15, ,10.02 =δσ 15.0=ρ , and B = 0.40. Figure 11
displays the results.
Figure 11. Power vs. number of clusters per site.
103
Clicking along the trajectory we can see that 8 classrooms per school are necessary to
achieve power = 0.80 to detect an effect size variability = 0.10. Note that the power to
detect the treatment effect increases rapidly towards 1 as the number of clusters per site
increases.
7.3.5 Scenario 4
Assume the researchers have secured 15 schools as well as 10 classrooms per
school and 25 students per classroom. What value of the intra-class correlation is
necessary to detect an effect size variability of 0.10 with power = 0.80?
In Scenario 4, the value of the intra-class correlation is unknown thus we select
the power vs. intra-class correlation ( ).ρ This allows ρ to vary along the x-axis.
To answer the question, click on Power vs. intra-class correlation ( ρ ). Then
move along the toolbar and specify n=25, J=10, K=15, and Figure 13 displays
the results.
.10.02 =δσ
Figure 13. Power vs. intra-class correlation.
104
Clicking along the trajectory reveals 11.0=ρ is necessary to detect effect size variability
of 0.10 with power = 0.80. Note that as the intra-class correlation increases, the power to
detect the treatment effect variability decreases.
7.3.6 Scenario 5
Based on past studies, the researchers estimate 15.0=ρ Assuming 15 schools are
willing to participate with 10 classrooms per school and 25 students per classroom. What
is the minimum effect size variability that results in power = 0.80?
In Scenario 5, the effect size variability is unknown thus we select power vs.
effect size variability ( This allows the effect size variability to vary along the x-).2δσ
axis.
ar and specify n=25, J=10, K=15,
To answer the question, click on Power vs. effect size variability ( ).2δσ Then
move along the toolb 15.0=ρ , and B
isplays the results.
Figure 14. Power vs. effect size variability.
=0.40. Figure 14
d
105
Clicking along the trajectory reveals that a minimum effect size variability of 0.087 can
be detected with power = 0.80. Note that as the effect size variability increases, the power
tends towards 1. Intuitively this makes sense. As there is more and more variation among
the effect sizes, it is easier to detect. Note that this is the opposite of what occurs in the
power for the treatment effect. In that test, as the effect size variability increases, the
power to detect the treatment effect decreases. This makes sense because it becomes
more difficult to detect the effect if there is a lot of variation.
In general, smaller sample sizes are required in order to achieve adequate power
to detect the variance in the treatment effect than the main effect of treatment. Thus the
primary focus of the researcher should be to design a study that has good power to detect
the main effect of treatment. Then the researcher should investigate the power to detect
the treatment effect variability since this has less stringent requirements than the first test.
7.4 Power for the Average Treatment Effect (Fixed Effects Model)
We can also calculate the power for the treatment effect in a fixed effects model.
In this case, the power for the treatment effect is a function of the cluster size, n, the
106
number of clusters, J, the number of sites, K, the intra-class correlation, ρ , the effect
size, δ , the percentage of variance explained by the blocking variable, B, and the cluster-
level covariate correlation, denoted in the program. In the fixed effects case, the
effect size variability, , is set equal to 0. Using the OD program to calculate the power
for the average treatment effect in a fixed effects model is the same as in the random
effects model discussed in Section 6.2 except that we specify =0. Only one example
for the fixed effects model is provided below since the directions in Section 6.2 can easily
be modified for a fixed effects model by specifying =0.
22LR
2δσ
2δσ
2δσ
Suppose researchers want to test the effect of a new math program designed for
students’ grades 1-12. Based on past studies, the researchers know that the school type,
elementary, middle, or high school, is related to the outcome, math achievement.
Research indicates that blocking on school type reduces the between-school variation by
70%. Research also indicates that the between cluster (school) variation prior to blocking
is 0.15. Within each school, the researchers test 100 students. What is the minimum
detectable effect the researchers can detect with 10 schools per site?
To answer the question, click on click on Power vs. effect size (delta). Then move
along the toolbar and specify n=100, J=10, K=3, ,00.02 =δσ 15.0=ρ , and B=0.70. Figure
15 displays the result.
107
Figure 15. MSCRT - Power vs. effect size.
A note appears on the screen indicating that this is a fixed effects model because the
effect size variability is set to 0. Clicking along the trajectory reveals that an effect size of
approximately 0.25 can be detected for power = 0.80.
108
8. Three Level Models with Randomization at Level Three
In Chapter 7, we discussed multi-site cluster randomized trials, or designs that
include three levels, where level three is a site, or block. For example, in a multi-site
cluster randomized trial, we might have students nested within classrooms within schools
where schools function as blocks. In this case, the randomization occurs at level 2, or the
classroom-level. In this chapter, we again consider three levels of data. However, in the
three level trial discussed in this chapter, the randomization occurs at level 3, or at the
school level. Chapter 8 provides a conceptual background for the three level model with
treatment at level 3, including the use of covariates at level 3. Chapter 9 describes how to
use the Optimal Design software to design a three level study with adequate power to
detect the treatment effect.
8.1 General Description of the Three Level Model
A three level trial with randomization at level 3 is a commonly used design. For
example, imagine an evaluation for a new elementary math program. Schools are
randomly assigned to either the new program or their regular program. Within each
school, all the classrooms adopt the new program. Thus, we have students within
classrooms within schools, where schools are assigned at random to treatment or control.
In order to calculate the power for this design, we need to account for the variability at
the child, classroom, and school levels. A common mistake is to simplify this trial to a
cluster randomized trial and ignore the classroom level. However, the variability in the
teachers might mean that students in classroom A might react differently to the program
than students in classroom B. In addition, students within a classroom might be more
similar to each other. We need to account for the classroom or teacher level variability so
that we do not overestimate the precision of the estimate and the power of the test.
This chapter focuses on how to calculate the power for a three level design under
two different conditions. First, we discuss power considerations when there is no level 3
covariate. Second, we look at the power for a design with a level 3 covariate.
8.2 The Three Level Model With No Covariates
Suppose a team of researchers is interested in the effect of a new comprehensive
school reform (CSR) on math outcomes. The CSR is implemented at the school level.
Schools are randomly assigned to either the CSR or their regular teaching methods.
109
Within each school, students are nested within classrooms. To account for the nested
structure of the data, the researchers use a three level model. Suppose there are 40
schools participating in the experiment, 20 treatment schools and 20 control schools.
Within each school there are 8 classes and 25 students within each class. The researchers
want to detect an effect size of 0.25. Assume that 20% of the variation lies between
classrooms and 10% of the variation lies between schools. What is the power of the test
to detect the treatment effect based on the above constraints?
Entering the information into the OD software reveals that the power to detect the
treatment effect is 0.58. The trajectory for this case is displayed in Figure 1.
Figure 1. Three level model – Power vs. number of sites.
Number of sites
Power
15 26 37 48 59
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 25 J = 8 τπ= 0.200 τβ = 0.100
δ= 0.25
Note that as the number of schools increases, the power to detect an effect also increases.
Let’s look at the model to help us understand the components of power in a three level
design.
8.2.1 The Model
We can represent the data from this design as persons nested within clusters
nested within sites. The level 1, or person-level model is:
ijkjkijk eY += 0π (1) ),0(~ 2σNeijk
110
where persons per cluster ni ,...,1=
j=1,…,J clusters per site
k=1,…,K sites
jk0π is the mean for cluster j in site k
is the error associated with each person ijke
is the within-cluster variance. 2σ
The level-2 model, or cluster-level model, is:
jkkjk r0000 += βπ ),0(~0 πτNr jk (2)
where k00β is the mean for site k
jkr0 is the random effect associated with each cluster
πτ is the variance between clusters within sites.
The level-3 model, or site-level model, is:
kkk uW 0000100000 ++= γγβ (3) ),0(~0000 k
Nu k βτ
where 000γ is the estimated grand mean
001γ is the treatment effect (“main effect of treatment”)
is 0.5 for treatment and –0.5 for control kW
ku00 is the random effect associated with each site mean
k00βτ is the residual variance between site means.
Note that unlike the multi-site cluster randomized trial, the randomization in this design
occurs at level 3.
8.2.2 Testing the Main Effect of Treatment
In the model above, the treatment effect is estimated at level 3 and is denoted
001γ . Given a balanced design, it is estimated by:
(4) CE YY__
001
^−=γ
where is the mean for the experimental group EY_
111
is the mean for the control group. CY_
Because of the nested structure of the data, we sum over clusters and sites in order to
estimate the treatment effect. The variance of the estimated treatment effect combines the
variance at all three levels, the variance between-site means, , the within-site or
between-cluster variance,
00βτ
πτ , and the within-cluster or between-person variance, .
Note that unlike a multi-site cluster randomized trial, there is no estimated variance
component for the between-site variance in the treatment effect,
2σ
.11βτ This difference is
easy to see by comparing the models for the two designs. There is no in the model for
the three level design. Conceptually, the difference exists because in the multi-site cluster
randomized trial, we have mini-experiments at each site which allow us to estimate K
treatment effects and to calculate the between-site variability of the treatment effect.
However, in the three level design, the treatment is applied at level 3 so we are only able
to estimate one treatment effect. The variance of the treatment effect is estimated by:
11βτ
K
JnVar k
]/)/([4)(
2
001
^00
σττγ πβ ++
= (5)
If the data are balanced, we can use the results of a nested analysis of variance
with random effects for the clusters and sites and a fixed effect for the treatment. Similar
to prior tests, the test statistic is an F statistic. The F test follows a non-central F
distribution, F(1, K-2; )λ . Recall that the noncentrality parameter,λ , is a ratio of the
squared-treatment effect to the variance of the treatment effect estimate. Below is the
noncentrality parameter for the test.
)( 001
^
2001
γ
γλ
Var=
]/)/([4 2
2001
00Jn
K
kσττ
γ
πβ ++= (6)
Recall that increasing the noncentrality parameter increases the power to detect
the treatment effect. Let’s examine how the researcher can increase the noncentrality
parameter to increase the power of the test. Because this model assumes no covariates,
we cannot reduce any of the variance components so , k00βτ ,πτ and are not under the
control of the researcher. The only remaining pieces of the noncentrality parameter are
the sample size and the size of the treatment effect. The size of the treatment effect is
2σ
112
often based on theory, past studies, or a pilot study which means the researcher cannot
inflate the size of the treatment effect to increase power without decreasing the
theoretical or practical conclusions of the study. Thus increasing the sample size is the
only option for increasing the power. From equation 6, we can see that increasing the
number of sites, K, is the most effective strategy to increase the power, followed by the
number of clusters, J, and finally the number of persons per cluster, n.
8.2.3 The Standardized Model with No Covariates
Thus far we have focused on the unstandardized model. However, as previously
mentioned, researchers typically discuss standardized effect sizes. We continue to utilize
Cohen’s definition for standardized effect sizes, and adopt 0.20, 0.50, and 0.80 as small,
medium, and large effect sizes.
In the standardized model, we set the sum of the within-cluster variance, the
between-cluster variance,
,2σ
,πτ and the between-site variance for the site means, k00βτ ,
equal to 1. Since we use three components of variance to standardize the model, we have
two intra-class correlations, 2levelρ and 3levelρ . The first intra-class correlation, 2levelρ ,
corresponds to the between-cluster variance relative to the total variance,
2200
σττ
τρ
βπ
π
++=
k
level . The second intra-class correlation, 3levelρ , is the between-site
variance relative to the total between and within site variance, 2σ
300
τ
τ
β +=
k
level00
τ π
β kρ+
.
In standardized notation, the non-centrality parameter, λ , can be rewritten as:
}/]/)1([{4 3223
2
JnK
levellevellevellevel ρρρρδ
λ−−++
= (7)
whereδ is the standardized main effect of treatment, 2
001
σττ
γδ
πβ ++= .
Because 2levelρ and 3levelρ are often unknown quantities and can be difficult to estimate,
the Optimal Design program asks the user for estimates of the proportion of variance at
level 1, level 2, and level 3 where the sum of three variances is constrained to equal 1.
The user should be able to more easily estimate these values.
113
8.3 The Three Level Model with a Site Level Covariate
Often a site-level covariate may be available to the researcher. The researchers
can use this information to reduce the level-3 variability, or the between-site variance. As
noted in Section 9.2.2, reducing the site-level variability can help increase the power of
the test. Because a site-level covariate is measured at level 3, it only effects the variability
at level 3, or . In other words, including a site-level covariate will not effect the
between-cluster variability,
k00βτ
πτ or the within-cluster variability, . We use S to denote a
site-level covariate in the model. The proportion of variance explained by the site-level
covariate is defined as .
2σ
2sβρ
00k
Let’s modify the example provided in section 8.2 to include a site-level covariate.
Recall that a team of researchers is investigating the effects of a CSR on math outcomes.
Suppose that school means on last years state math test are available to the researcher.
Suppose this site-level covariate reduces the level 3 variance by 64%. Recall that 20% of
the variation lies between classrooms, 10% of the variation in the outcome lies between
schools, and 70% of the variation lies within classrooms. The site-level covariate reduces
the between school variance to (1-.64)*(0.10)=0.036, meaning that only 3.6% of the
variation between sites is unexplained. Recall that there are 40 schools participating in
the experiment, 20 treatment schools and 20 control schools. Within each school there are
8 classes and 25 students within each class. Under the original conditions, with no site-
level covariate, the power to detect the main effect of treatment without a covariate was
0.58. Using the OD software, we can see that including the site-level covariate results in
power equal to 0.86. Figure 2 displays the results.
114
Figure 2. Three level model – Power vs. number of sites.
Number of sites
Power
15 26 37 48 59
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 n = 25 J = 8 τπ= 0.200 τβ = 0.100
δ= 0.25δ= 0.25,R2
L3= 0.64
Let’s look at the model for the design with a site-level covariate to see how the
site-level covariate effects the power of the test.
8.3.1 The Model
Levels 1 and 2 of the model with a site-level covariate are identical to the level 1
and 2 equations (equations 1 and 2) for the case with no covariate. This is because
inclusion of a site-level covariate does not effect the variability in the lower levels in the
model. The new level 3, or site level model is:
kkkk uSW 0000200100000 +++= γγγβ ),0(~ |00 00 sk Nu βτ (8)
Note: kkk ss 000000
)1( 2| βββ τρτ −=
where 000γ is the estimated grand mean
001γ is the treatment effect (“main effect of treatment”)
002γ is the regression coefficient for the level 3 covariate
kW is 0.5 for treatment and –0.5 for control
kS is the level 3 covariate
ku00 is the random effect associated with each site mean
115
s|00βτ is the residual variance between site means conditional on the site-level
covariate
Note that the level 3 variance is adjusted for the covariate. The smaller variance will
increase the precision of the estimate thus increasing the power of the test.
Given a balanced design, the main effect of treatment is estimated as the
difference in the treatment and control groups adjusted for the site-level covariate:
)(__
002
^__
001
^
CECE SSYY −−−= γγ . (9)
The variance of the treatment effect is:
KJn
SVar sk]/)/([4
)|(2
|001
^00
σττγ πβ ++
= . (10)
Note that only the between-site variance, , is adjusted for inclusion of the covariate
since it is at the covariate is at the site level.
k00βτ
Similar to the case with no covariate, to test the main effect of treatment we use
an F-statistic which follows a non-central F distribution, F(1, K-3, sλ ) where:
]/)/([4 2|
2001
00Jn
K
ss
kσττ
γλπβ ++
= . (11)
The noncentrality parameter for the test for the main effect of treatment looks similar to
equation 6, the case with no covariate, except that the level 3 variance and the estimate of
the treatment effect are adjusted for the cluster level covariate. Note that reducing the
variability at level 3 gives the researcher another tool for increasing the noncentrality
parameter and increasing the power. In cases when the between-site variance accounts for
a high proportion of the variance, finding a site-level covariate that is highly correlated
with the site-level outcome can be very beneficial. It may also help reduce the number of
sites necessary to achieve a specified power, which can reduce the cost of the study.
116
8.3.2 The Standardized Model with a Site-Level Covariate
Following the same logic as the three level model with no covariates, it is
important to standardize the model. The noncentrality parameter expressed in
standardized notation is:
}/]/)1([{4 *3
*2
*2
*3
2*
JnK
levellevellevellevels ρρρρ
δλ−−++
=
where
is the intra-class correlation *2levelρ
2|
*2
00σττ
τρ
βπ
π
++=
slevel
k
, or the proportion of
variance among clusters relative to the total variation conditional on the level-3 covariate.
is the intra-class correlation *3levelρ 2
|
|*3
00
00
στττ
ρπβ
β
++=
s
slevel
k
k , or the proportion of
variance among sites relative to the total variation conditional on the level-3 covariate. *δ is the standardized main effect of treatment conditional on the level-3
covariate,2
|
*
00σττ
δδπβ ++
=sk
.
Because the conditional standardized quantities, , , and , are
frequently unknown, the program asks the user to enter the unconditional parameters. The
program calculates the conditional standardized values based on the value the user
specifies for the percent of variance reduction at level 3, .
*2levelρ
23levelR
*3levelρ *δ
117
9. Using the Optimal Design Software for the Three Level Model with Treatment at
Level Three
This chapter focuses on how to use the OD software to design a three level trial
with treatment at level three. Section 9.1 provides general information for how to use the
three level model with treatment at level three. Section 9.2 provides an example and
details regarding the options within the three level design option.
9.1 General Information
The three level trial with treatment at level three option allows the researcher to
calculate the power for the average treatment effect as a function of the cluster size, the
number of clusters per site, the number of sites, the effect size, the level-2 variability, the
level-3 variability, and the proportion of variance explained by the site level covariate.
The menu is below:
Three Level Model with Treatment at Level 3
Power vs. cluster size (n)
Power vs. number of clusters per site (J)
Power vs. number of sites (K)
Power vs. effect size (delta)
Power vs. proportion of variance reduction at level 3 (R2)
9.2.1 Example
A team of researchers is designing a study to determine if a particular whole
school reform model improves academic achievement. The design consists of students
nested within classrooms nested within schools, which naturally lends itself to a three
level model. The reform effort is implemented at the level of the school, so the most
appropriate design is a three level design with treatment at level three. The researchers
plan to test students who are in the schools that participate in the new school reform
model (experimental group) and students who are in the schools that participate in the
regular model (control group) to determine if students in the new school reform model
score higher on an academic assessment. The researchers are unsure how to proceed with
respect to the number of students they should test in each classroom, the number of
classrooms in each school, and the number of schools in order to conduct a trial with
118
power = 0.80. Five scenarios the researchers might encounter are presented below.
Assume 05.0=α for each case.
9.2.2 Scenario 1
Based on past studies, the researchers estimate that 10% of the variability is at the
classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving
80% of the variability between students. A minimum standardized effect size of 0.25 is
desired. Assuming 40 schools are willing to participate as well as 8 classrooms within
each school, how many students within each classroom are necessary to achieve power =
0.80? What if the researchers include a school-level covariate that explains 49% of the
variation in the school level mean. How many students per classroom are required to
achieve power = 0.80?
In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size
(n) option. Figure 1 displays the screen.
Figure 1. 3 Level Model Screen.
The buttons in the toolbar are explained below.
119
α - specifies the significance level, or chance of a Type I error. By default, α is set at
0.05, which is a common level for most designs.
K – specifies the number of sites. By default, K is set at 30.
J – specifies the number of clusters within each site. By default, J is set at 6.
δ - specifies the minimum effect size of interest. By default, the minimum effect size is
set at 0.20.
Set – specifies the proportion of variability at level-2 and level-3. The total variability for
levels 1,2, and 3 is 1. After entering the proportion of variability at levels 2 and 3, click
on compute sigma. The default settings are πτ (level-2 variability) = 0.10 and βτ (level-3
variability) = 0.10 so (level-1 variability) = 0.80. 2σ
23LR - specifies the proportion of explained variance in the level 3 mean outcome by the
level 3 covariate. By default, is set to 0. 23LR
The remaining options in the toolbar are the same as those in the other modules. The
details can be found in Chapter 3.
Follow the steps below to answer the questions:
Step 1: Click on power vs. cluster size (n).
Step 2: Click on J on the toolbar and set J = 8. A graph will appear but by looking at the
key we can see it does not match the specific settings for this example.
Step 3: Click on K and set K = 40.
Step 4: Click on δ and set δ = 0.25.
Step 5: Click on . Leave (1) equal to 0. Set equal to 0.49. Figure 2
displays the screen.
23LR 2
3LevelR 23LevelR
Figure 2. Three Level Model – Power vs. cluster size.
120
Note that two trajectories appear on the screen. Without the covariate, we cannot achieve
power = 0.80. However, clicking along the dotted trajectory it is clear we can see that
including the covariate allows us to achieve power = 0.80 with only 9 students per
classroom. The level-3 covariate helps increase the power and is usually rather
inexpensive to collect since it is a site-level characteristic which in the case of schools
can usually be found in a central database.
9.2.3 Scenario 2
Based on past studies, the researchers estimate that 10% of the variability is at the
classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving
80% of the variability between students. A minimum standardized effect size of 0.25 is
desired. Assuming 40 schools are willing to participate and there is an average of 30
students per classroom, how many classrooms are necessary to achieve power = 0.80?
What if the researchers include a school-level covariate that explains 49% of the variation
in the school level mean. How many classrooms are required to achieve power = 0.80?
In Scenario 2, the number of clusters is unknown thus we select the power vs.
number of clusters per site (J) option.
121
Follow the steps below to answer the questions:
Step 1: Click on power vs. number of cluster per site (J).
Step 2: Click on non the toolbar and set n = 30. A graph will appear but by looking at the
key we can see it does not match the specific settings for this example.
Step 3: Click on K and set K = 40.
Step 4: Click on δ and set δ = 0.25.
Step 5: Click on . Leave (1) equal to 0. Set equal to 0.49. Figure 3
displays the screen.
23LR 2
3LevelR 23LevelR
Figure 3. Three Level Model – Power vs. number of clusters per site.
Note that like the cluster size, increasing the number of clusters per site does not result in
power = 1.0. It is clear that without a covariate, we cannot achieve power = 0.80.
Including the covariate results in power = 0.80 for 5 clusters per site. Let’s see what
happens when we allow the number of sites to vary.
9.2.4 Scenario 3
Based on past studies, the researchers estimate that 10% of the variability is at the
classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving
80% of the variability between students. A minimum standardized effect size of 0.25 is
122
desired. Assuming there is an average of 8 classrooms within each school that are willing
to participate and 30 students per classroom, how many schools are necessary to achieve
power = 0.80? What if the researchers include a school-level covariate that explains 49%
of the variation in the school level mean. How many schools are required to achieve
power = 0.80?
In Scenario 3, the number of sites is unknown thus we select the power vs.
number of sites (K) option.
Follow the steps below to answer the questions:
Step 1: Click on power vs. number of sites (K).
Step 2: Click on n on the toolbar and set n = 30. A graph will appear but by looking at the
key we can see it does not match the specific settings for this example.
Step 3: Click on J and set J = 8.
Step 4: Click on δ and set δ = 0.25.
Step 5: Click on . Leave (1) equal to 0. Set equal to 0.49. Figure 4
displays the screen.
23LR 2
3LevelR 23LevelR
Figure 4. Three Level Model – Power vs. number of sites.
123
Note that as the number of sites increases, the power goes to 1. Clicking along the
trajectory, we can see that 60 sites achieves power = 0.80 in the case of no covariate.
Including the covariate, reduces the number of necessary sites to 36, which is much more
reasonable.
9.2.5 Scenario 4
Based on past studies, the researchers estimate that 10% of the variability is at the
classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving
80% of the variability between students. A minimum standardized effect size of 0.25 is
desired. Assume 40 schools are willing to participate, 8 classrooms within each school,
and 30 students within each class. What is the minimum detectable standardized effect
for power = 0.80? What if the researchers include a school-level covariate that explains
49% of the variation in the school level mean. What is the minimum detectable effect that
achieves power = 0.80?
In Scenario 4, the minimum detectable effect is unknown thus we select the power
vs. effect size option.
Follow the steps below to answer the questions:
Step 1: Click on power vs. effect size (delta).
Step 2: Click on n on the toolbar and set n = 30. A graph will appear but by looking at the
key we can see it does not match the specific settings for this example.
Step 3: Click on J and set J = 8.
Step 4: Click on K and set K = 40.
Step 5: Click on . Leave (1) equal to 0. Set equal to 0.49. Figure 5
displays the screen.
23LR 2
3LevelR 23LevelR
124
Figure 5. Three Level Model – Power vs. effect size.
Clicking along the trajectory reveals a minimum detectable effect of 0.30 in the case of
no covariate and 0.24 in the case of the covariate for power = 0.80.
9.2.6 Scenario 5
Based on past studies, the researchers estimate that 10% of the variability is at the
classroom level (level 2) and 10% of the variability is at the school level (level 3) leaving
80% of the variability between students. A minimum standardized effect size of 0.25 is
desired. Assume 40 schools are willing to participate, 8 classrooms within each school,
and 30 students within each classroom. What proportion of the variability in the level 3
outcome does the level 3 covariate need to explain in order to achieve power = 0.80?
In Scenario 5, the proportion of variance reduction as a result of the level 3
covariate is unknown thus we select the power vs. proportion of variance reduction at
level 3 (R2) option.
Follow the steps below to answer the questions:
Step 1: Click on power vs. proportion of variance reduction at level 3 (R2).
Step 2: Click on n on the toolbar and set n = 30. A graph will appear but by looking at the
key we can see it does not match the specific settings for this example.
125
Step 3: Click on J and set J = 8.
Step 4: Click on K and set K = 40.
Step 5: Click on δ and set δ = 0.25. Figure 6 displays the screen.
Figure 6. Three Level Model – Power vs. proportion of variance reduction at level 3.
Clicking along the trajectory reveals power = 0.80 can be achieved if 43% of the
variation in the level 3 mean outcome is explained by the level 3 covariate. Note that as
the proportion of explained variation increases towards 1, the power also increases
towards 1.
126
10. Repeated Measures in Cluster Randomized Trials
Chapters 10 and 11 explore longitudinal research designs. In a longitudinal study,
or repeated measures design, people are followed over time and observed on several
occasions. Chapter 10 explores the conceptual framework surrounding repeated measures
in cluster randomized trials and Chapter 11 describes how to use the Optimal Design
software to design a cluster randomized trial with repeated measures.
10.1 Why repeated measures?
In a typical longitudinal study, observations are recorded prior to treatment, often
referred to as the baseline measurement, and then after the treatment a pre-determined
number of times. Measuring participants prior to treatment and post-treatment allows the
researchers to assess individual growth. Individual growth may be plotted via a straight
line or a curvilinear trajectory. A linear trajectory, or first degree polynomial, is
characterized by an intercept and a linear rate of change, or slope. Curvilinear trajectories
are second, third, or higher degree polynomials. A second degree polynomial, also known
as a quadratic polynomial, adds an acceleration parameter to the intercept and rate of
change. A third degree polynomial, also known as a cubic polynomial, is characterized by
4 parameters, change in acceleration, rate of acceleration, linear rate of change, and an
intercept.
In a simple repeated measures design, individuals are repeatedly observed and
individual trajectories are plotted to assess average treatment effects on a specific
polynomial change parameter. In this chapter we extend the simple design to settings in
which individuals are nested within clusters and treatment is applied at the cluster level.
This allows us to assess the average difference in the polynomial change parameter for
those in the treatment group and those in the control group, accounting for the cluster
effect.
To illustrate, imagine that a group of researchers develop a new phonics program
for first graders. The program is an intense year-long program. Students are assessed at
the beginning of the year, prior to treatment, and five times throughout the year. 40
classrooms have been randomly selected to participate in the study, 20 in the treatment
group and 20 in the control group. Each classroom has 25 students and all 25 will
participate in the study. Since we have repeated measures on students who are nested
127
within classrooms, we must treat the design as a cluster randomized trial with repeated
measures in order to determine the power of the test correctly. If we ignore the clusters,
the estimate of the variance of the treatment effect and the power calculations will not be
correct.
The power to detect the main effect of treatment in a repeated measure cluster
randomized trial is more complicated than in a cluster randomized trial because we need
to take the repeated measures on each person into consideration. However, in this chapter
we try to keep things simple by focusing only on orthogonal designs with continuous
outcomes, a random-effects covariance structure, homogeneous covariance structures
within treatments, and complete data. In these designs, power is a function of the
frequency of observations, f, the duration of the study, D, the total number of
observations, M, the number of participants within each cluster, n, the number of clusters,
J, the effect size, ,δ the intra-class correlation, ρ , the within-person variance, , the
between-person variance,
2σ
pπτ , on the polynomial change parameter of order p, and
associated between-cluster variance, pβτ . The data lend to the three-level hierarchical
model described in the next section.
10.2. The Model
Data from a cluster randomized trial with repeated measures on the individuals
can be represented with a three-level model, with occasions nested within persons and
persons nested within clusters. The general level-1 model, or repeated measures model,
represents the trajectory of change for person i as a polynomial function of degree 1−P
defined at equally spaced observations. The model is:
mij
P
ppmpijmij ecY += ∑
−
=
1
0π , (1) ),0(~ 2σNemij
for observations, },...,2,1{ Mm ∈ },...,2,1{ ni ∈ persons, and },...,2,1{ Jj ∈ clusters,
where p is the polynomial order of change (e.g., linear, quadratic, or cubic);
pijπ is the level-1 coefficient for the polynomial of order p;
is the orthogonal polynomial contrast coefficient; pmc
is the error associated with the repeated measures; and mije
128
is the within-person variance. 2σ
Note the orthogonal polynomial contrast coefficients are necessary to center the data.
These coefficients are given by (see, e.g., Kirk 1982; Raudenbush and Liu 2001):
1 , (2) 0 =mc
∑=
−=M
mm Mmmc
11 / ,
⎟⎠
⎞⎜⎝
⎛−= ∑
=
M
mmmm Mccc
1
21
212 /
21 , and
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
−=
∑
∑
=
=mM
mm
M
mm
mm cc
ccc 1
1
21
1
41
313 6
1 .
The level-2 model, or person-level model, is:
pijjppij r+= 0βπ , ),0(~ ppij Nr πτ (3)
where jp0β is the cluster mean for the pth polynomial change parameter;
is the random effect associated with the persons; and pijr
pπτ is the between-person variance for the pth polynomial change parameter.
The level-3 model, or cluster level model is:
jpjppjp uW 001000 ++= γγβ , ),0(~0 pjp Nu βτ (4)
where 00pγ is the grand mean for the polynomial order of change;
01pγ is the main effect of treatment;
is a treatment contrast indicator, ½ for treatment and -½ for control; jW
is the random effect associated with each cluster; and jpu 0
pβτ is the between-cluster variance for the polynomial order of change.
129
To help clarify the general model, consider a first degree polynomial order of
change, or linear model ( . The level-1 model is: )1=p
mijmijijmij ecY ++= 110 ππ , (5) ),0(~ 2σNemij
for occasions, persons, and },...,2,1{ Mm ∈ },...,2,1{ ni ∈ },...,2,1{ Jj ∈ clusters,
where ij0π is the mean response for person i in cluster j on occasion m;
ij1π is the average rate of change for person i in cluster j on occasion m;
is the orthogonal linear contrast coefficient; mc1
is the error associated with the repeated measures; and mije
is the within-person variance. 2σ
Note that in the case of the linear model, the contrast coefficients are easily computed
using the formulas in equation 2. For example, if M=5, the orthogonal contrast
coefficients for a first degree polynomial are:
(6) )2,1,0,1,2(
)1,1,1,1,1(
1
0
−−==
cc
The level-2 model, or person level model is:
ijjij r0000 += βπ ),0(~ 00 πτNr ij (7)
ijjij r1101 += βπ ),0(~ 11 πτNr ij
where j00β is the mean response in cluster j;
j10β is the average growth rate in cluster j;
is the random effect associated with the mean response for person i in cluster j; ijr0
is the random effect associated with the growth rate for person i in cluster j; ijr1
0πτ is the between-person variance in means; and
1πτ is the between-person variance in growth rates.
The level-3 model, or cluster level model is:
130
jjj uW 0000100000 ++= γγβ ),0(~ 000 βτNu j (8)
jjj uW 1010110010 ++= γγβ ),0(~ 110 βτNu j
where 000γ is the grand mean;
001γ is the main effect of treatment for the mean;
is the treatment indicator, ½ for treatment and -½ for control; jW
100γ is the average growth rate;
101γ is the main effect of treatment for the growth rates;
is the random effect associated with the mean for each cluster; ju00
is the random effect associated with the growth rate for each cluster; ju10
0βτ is the between-cluster variance in means; and
1βτ is the between-cluster variance in growth rates.
Note that for a first degree polynomial, our primary interest is in growth rates, thus we
are interested in 101γ , the main effect of treatment on the growth rates, and in 1βτ , the
between-cluster variance in growth rates.
10.3 Testing the Main Effect of Treatment
The average treatment effect for the pth polynomial order of change in our
balanced design is defined in level 3 of the model. It is estimated by:
(9) CEp YY__
01 −=∧
γ
Note that the estimated main effect of treatment looks like that in the cluster randomized
trial except that now we are averaging over occasions and persons. The variance of the
treatment effect for the pth polynomial order of change (Raudenbush and Liu 2001) is:
J
nVVar ppp
p
]/)([4)ˆ( 01
++= πβ ττ
γ , (10)
)!()!1(22
1
2
2
pMKpMf
cV
p
p
M
mpm
p +−−
==
∑=
σσ ,
where is the frequency of observation; f
131
D is the duration of the study;
M is the total number of occasions, M=Df+1;
p is the polynomial order of change; and
is a constant, where =1/12, =1/720, and =1/100,800. pK 1K 2K 3K
Note that denotes the conditional variance of the least squares estimate of each
participant’s change parameter.
pV
We can translate the above formulas to a more concrete example in the case of a
first degree polynomial. For a first degree polynomial, the variance of the estimate of the
treatment effect is:
J
nVVar
]/)([4)ˆ( 111
101
++= πβ ττ
γ (11)
where )!1(12/1)!2(22
1
21
2
1 +−
==
∑=
MMf
cV M
mm
σσ . (12)
In the general case, we can use the following hypotheses to test the significance of
the main effect of treatment for the polynomial order of interest:
0:
0:
011
010
≠
=
p
p
H
H
γ
γ (13)
When the null hypothesis is true, the test statistic is an F statistic and follows a central F
distribution, F(1, J-2). The test statistic is:
)ˆ(
ˆ
01
01
p
p
VarF
γγ
= . (14)
When the alternative hypothesis is true, the test statistic remains the same but
follows a noncentral F distribution, F(1, J-2; λ ). Recall that the noncentrality parameter
is a ratio of the squared treatment effect to the variance of the treatment effect estimate.
The noncentrality parameter is:
]/)([4)ˆ(
201
01
201
nVJ
Var ppp
p
p
p
++==
πβ ττγ
γγ
λ (15)
132
Recall that the larger the noncentrality parameter, the greater the power of the test.
Looking at the formula, we can see that J is the most influential sample size for
increasing the power. In other words, the number of clusters in more important than the
number of people within each cluster to increase the power. It is particularly important to
have a large number of clusters if there is a lot of between-cluster variation, pβτ . Also,
increasing the number of occasions, M, reduces the within-person variance, which
increases the power. Note that M is a function of f and D, where M=(fD+1) so increasing
the frequency of the observations or duration of the study increases M. Increasing n, the
number of people within each cluster, will also decrease the total within and between-
person variance, thus increasing the power. Finally, larger effect sizes increase the power
to detect a treatment effect.
Thus far, we have concentrated on the unstandardized model. However, similar to
cluster randomized trials, researchers typically use standardized models and effect sizes.
As discussed in Chapter 1, we will use Cohen’s rules of thumb for standardized effect
sizes, with 0.20, 0.50, and 0.80 as small, medium, and large effect sizes. Let’s see how
we translate the model to standardized notation.
The standardized effect size for a polynomial of order p is:
pp
p
πβ ττ
γδ
+= 01 (16)
where 01pγ is the main effect for the polynomial order of change, and
pp πβ ττ + is the total between-cluster and between-person variance, denoted τ .
In words, δ is the group difference on the polynomial of interest divided by the standard
deviation for that polynomial, or the square root of the sum of the between-cluster
variance and the between-person variance for the specified polynomial. Similar to
standardized models we defined in previous chapters, we need to define ρ , the intra-
class correlation. The intra-class correlation, ρ , is:
pP
p
πβ
β
τττ
ρ+
= (17)
133
w pp πβhere ττ + is the total between-cluster and within-cluster variance; τ =
is the between-cluster variance on the polynomial of interest; and pβτ
is the within-cluster variance on the polynomial of interest. pπτ
Note that if 1=τ , then ρτ β =p and ρτπ −= 1p which is consistent with the intra-cl
correlation for a cluster randomized trial. Also,
ass
ρ is a ratio of the between-cluster
riance to the total variance for a specific polynomial order of change. We can think of va
ρ as partitioning the g e va een-cluster and within-cluster
ompon
rowth-rat riance into a betw
c ent.
Using the standardized effect size, ,δ ρ , and constraining ,1=τ we can rewrite
e variance of the treatment effect estimth ate as
J
Var pp )ˆ( 01 =
nV ]/)1([4 +−+ ρργ . (18)
Another simplification involves rewriting the variance in terms of the reliability of the
person-specific polynomial change. The reliability is denoted pα and is defined as:
pp
p V+=
πτpπτ
α . (19)
ewriting the variance in terms of the reR liability we get:
J
Var pp )ˆ( 01
n)]/()1([4 ρ αργ = . (20)
We write the variance in this form because standard p
−+
rograms for hierarchical data often
give us
arameter in terms of the standardized
notation. The new noncentrality
an estimate of the person specific reliability.
We can also rewrite the noncentrality p
parameter is:
)]/()1([4 npαρρλ
−+= (21)
Note that the power is n a function of the number of clusters, J, the cl ter s
n, the standardized effect size, ,
2Jδ
ow us ize,
δ the intra-class correlation, ρ , the reliability, pα ,
which is a function of the between-person variance, pπτ , the within-person variance, ,2σ
134
the study duration, D, the frequency of the observations, f, and the number of occasio
M. It is important to be familiar with the standardized nota
ns,
tion because the Optimal
operates wit the standardized notation.
10.4 Ex
of
raders.
atment
ers
Design software h
ance to b
amples
To illustrate the use of repeated measures in a cluster randomized trial, consider
the example below, which is a modification of the example introduced at the beginning
the chapter. Imagine that researchers develop a new phonics program for first g
The program is a five year program. Students are assessed one time each year.
Researchers are interested in the growth rate of students so they propose a linear model.
40 classrooms have been randomly selected to participate in the study, 20 in the tre
group and 20 in the control group. Each classroom has 25 students and all 25 will
participate in the study. Based on a past 2-level repeated measures design, the research
estimate the within-person vari e 1.0 and the overall variability in the growth
rates = 1.0. They hypothesize 10.0=ρ but would like to allow ρ to vary along the x-
axis. They want to detect a minimum effect size of 0.30. What is the power of the test to
etect td he main effect of treatment for 10.0=ρ ?
First let’s list the information that is given.
J=40
n =25
ρ =0.10
δ = .30
M=
f =1
D=5
6 2 = 1.0 σ
τ = 1.0 (Note that this is pp πβ ττ + . We estimate how it is partitioned by .)ρ
Entering this information into the cluster randomized trials with repeated
measures option produces the graph in Figure 1.
135
Figure 1. Power vs. Intra-Class Correlation.
Power
0.11 0.21 0.30 0.40Intra-class correlation
0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0α = 0.050 F = 1 D = 5 M = 6
σ2 = 1.000000
τ = 1.000000 δ = 0.300000
J= 40,n= 25
Note that ρ is allowed to vary along the x-axis as a function of the power. The key in the
upper right hand corner can be used to confirm the settings. Clicking along the trajectory
at 10.0=ρ reveals that the power = 0.70. We can see that for smaller values of ρ , the
power is higher.
Another scenario a researcher might encounter in designing a cluster randomized
trial with repeated measures is described below. Suppose a researcher conducted a 3-level
pilot study about the phonics program described in the previous example. From the pilot
study, he estimates the within-person variation, the between-person variation
on growth rates,
,0.12 =σ
1πτ = 0.90, and the between-cluster variation on growth rates, 1βτ =0.10.
The students are tested one time in each of five years and 40 schools are selected to
participate with 25 students in each school. What is the power to detect an effect size of
0.30?
First, we need to translate the information given into a form that is acceptable in
the Optimal Design program. We know ,0.12 =σ ,90.01 =πτ and 10.01 =βτ . Thus we
can calculate:
136
10.010.090.0
10.0=
+=ρ and 0.190.010.0 =+=τ
Note that all of the parameters are the same as in example 1 so we will get the same
results. The idea here is that although the information may appear different in its original
form, it is important to translate it into the correct parameters required by the program
before continuing further.
137
11. Using the Optimal Design Software for Cluster randomized trials with Repeated
Measures
This chapter focuses on how to use the OD software to design a cluster
randomized trial with repeated measures. Section 11.1 provides general information
about how to use the cluster randomized trial with repeated measures option. Section 11.2
provides an example and details for using the software.
11.1 General Information
The cluster randomized trial with repeated measures option allows the researcher
to explore the power for the main effect of treatment as a function of the cluster size, n,
the number of clusters, J, the intra-class correlation, ρ , and the desired effect size, δ .
Below is the menu.
Cluster Randomized Trials Repeated Measures
Power vs. Cluster Size (n)
Power vs. Number of Clusters (J)
Power vs. Intra-class Correlation ( ρ )
Power vs. Effect Size (δ )
11.2.1 Example
For illustration purposes, we modify the example introduced in Chapter 10.
Imagine that a group of researchers develop a new phonics program for first graders. The
program is an intense year-long program. The researchers propose a repeated measures
design for students nested within schools. They plan to assess students at the beginning of
the year, prior to treatment, and then on six occasions throughout the year. Researchers
are interested in the growth rate of students so they propose a linear model. A past two-
level repeated measures design estimates the within person variability to be 10.0 and the
overall variability in growth rates to be 1.0. They want to explore different designs to try
to achieve high power. Four scenarios the researcher might encounter are presented
below. Assume alpha = 0.05 for each case.
138
11.2.2 Scenario 1
The researchers hypothesize ρ = 0.05. In other words, 5% of the total variation in
growth rates is between-cluster variation. They want to detect a minimum effect size of
0.25. Assuming 40 schools are willing to participate in the study, how many students do
they need in each school to achieve power = 0.80?
In Scenario 1, the cluster size is unknown thus we select the power vs. cluster size
(n) option. Figure 1 displays the screen.
Figure 1. CRTRM - Power vs. Cluster Size (n).
Let’s take a closer look at the function of each of the buttons on the toolbar.
α - specifies the significance level, or chance of a Type I error. By default, α is set at
0.05, which is a common level for most designs.
J – specifies the number of clusters. By default, J is set at 20.
ρ - specifies the intra-class correlation. Recall it is defined as pp
p
πβ
β
τττ
ρ+
= . By default,
ρ is set at 0.05 and 0.10.
δ - specifies the effect size. By default, δ is set at 0.40.
Set – Within the Set button, there are a variety of settings listed below.
139
F – specifies the frequency of observations.
D – specifies the duration of the study.
M – specifies the total number of observed occasions and is a function of F and D
where M=(FD+1).
Variability of level-1 residual, - specifies the within-person variation 2σ
Variability of level-1 coefficient, τ - specifies the sum of the between-person and
between-cluster variation, pβpπ ττ +
Polynomial Order – allows the researcher to select either a linear, quadratic, or
cubic model.
The remaining options on the toolbar are the same as those in the Cluster randomized
trial option, which are explained in full detail in Chapter 2.
Now let’s use the software to explore the question in Scenario 1. To answer the
question, click on Power vs. cluster size (n). Then move along the toolbar and specify
J=40, 05.0=ρ , and 25.0=δ . Click on the Set button and change D=6. Note this makes
M=7 because M=(DF+1). Set to 10.0 and note that2σ τ is set to 1.0, which matches our
design. The linear model selection is already checked which also matches the design.
Figure 2 displays the results.
140
Figure 2. CRTRM - Power vs. Cluster Size (n).
Note that the key in the upper right corner reflects the settings we identified. The cluster
size is allowed to vary along the x-axis as a function of the power. Clicking along the
trajectory reveals that 50 students per school are required to achieve power = 0.80. Note
that the power does not converge to 1 as the sample size per cluster is increased.
11.2.3 Scenario 2
The researchers hypothesize ρ = 0.05. Again, 5% of the total variation in growth
rates is between-cluster variation. They want to detect a minimum effect size of 0.25.
Assuming 25 students are willing to participate in each school, how many schools do
they need to achieve power = 0.80?
In Scenario 2, the number of clusters is unknown thus we select power vs. number
of cluster (J) option. This allows the number of clusters to vary along the x-axis. As a
result, J will be replaced on the toolbar by the cluster size (n) icon. The remaining options
function as previously described. Now let’s use the software to explore the question in
Scenario 2.
To answer the question, click on Power vs. number of clusters (J). Then move
along the toolbar and specify n=25, 05.0=ρ , and 25.0=δ . Click on the Set button and
141
change D=6. Note this makes M=7 because M=(DF+1). Set to 10.0 and note2σ τ is set to
1.0 which matches our design. The linear model selection is already checked which also
matches the design. Figure 3 displays the results.
Figure 3. CRTRM - Power vs. Number of Clusters (J).
Clicking along the trajectory reveals that 54 clusters are necessary to achieve power =
0.80. Note that unlike the cluster size, as the number of clusters increases, the power
tends towards 1.0. This corresponds to the information presented in Chapter 7, which
states that the number of clusters is more influential on power than cluster size.
11.2.4 Scenario 3
The researchers want to detect a minimum effect size of 0.25. Assume they are
able to secure 40 schools and 25 students within each school. What value of ρ achieves
power = 0.80? What does this value of ρ mean?
In Scenario 3, the intra-class correlation, ρ , is unknown thus we select the power
vs. intra-class correlation option. Again, the only change in the toolbar is that the ρ no
longer appears since it is allowed to vary along the x-axis. This is a very useful option for
the case where the overall variability in the polynomial order of change is estimated from
142
a 2-level repeated measures design, but researchers are unclear about how the variability
is partitioned into between-person and between-cluster variability. This option allows the
researchers to calculate the power for different values of ρ . Let’s explore Scenario 3 to
get a better idea of this option.
To answer the question, click on Power vs. intra-class correlation ( ρ ). Then
move along the toolbar and specify J=40, n=25, and 25.0=δ . Click on the Set button and
change D=6. Note this makes M=7 because M=(DF+1). Set to 10.0 and note2σ τ is set to
1.0 which matches our design. The linear model selection is already checked which also
matches the design. Figure 4 displays the results.
ρ ). Figure 4. CRTRM - Power vs. Intra-class Correlation (
Clicking along the trajectory reveals that 02.0=ρ results in power = 0.80. This means
that only 2% of the overall variability in growth rates can be attributed to the between-
cluster variation. Note that as the intra-class correlation increases, or more of the
variability is between clusters, the power of the test decreases.
143
11.2.5 Scenario 4
The researchers hypothesize ρ = 0.05. Again, 5% of the total variation in growth
rates is between-cluster variation. Assume that they are able to secure 40 schools and 25
students within each school. What is the minimum effect size they can detect with power
= 0.80?
In Scenario 4, the effect size is unknown thus we select power vs. effect size (δ )
option. Again, the only change in the toolbar is thatδ no longer appears since it is
allowed to vary along the x-axis. Let’s explore Scenario 4 using the software.
To answer the question, click on Power vs. effect size (δ ). Then move along the
toolbar and specify J=40, n=25, and 05.0=ρ . Click on the Set button and change D=6.
Note this makes M=7 because M=(DF+1). Set to 10.0 and note2σ τ is set to 1.0 which
matches our design. The linear model selection is already checked which also matches
the design. Figure 5 displays the results.
Figure 5. CRTRM - Power vs. Effect Size (δ ).
144
Clicking along the trajectory reveals that 29.0=δ results in power = 0.80. Note that as
the effect size increases, the power also increases, which is consistent with the
information in Chapter 7.
145
146
References
Kirk, Roger. E. 1982. Experimental Design: Procedures for the Behavioral Sciences.
Second ed. Belmont, CA: Brooks/Cole. Raudenbush, Stephen W. 1997. Statistical Analysis and Optimal Design for Cluster
Randomized Trials. Psychological Methods 2 (2):173-185. Raudenbush, Stephen W., and Anthony S. Bryk. 2002. Hierarchical Linear Models:
Applications and Data Analysis Methods. 2 ed. Thousand Oaks, California: Sage Publications.
Raudenbush, Stephen W., and Xiaofeng Liu. 2000. Statistical Power and Optimal Design for Multisite Randomized Trials. Psychological Methods 5 (2):199-213.
———. 2001. Effects of Study Duration, Frequency of Observation, and Sample Size on Power in Studies of Group Differences in Polynomial Change. Psychological Methods 6 (4):387-401.