Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should...

46
14 Graphical Approaches to Multiple Testing Frank Bretz, Willi Maurer, and Jeff Maca CONTENTS 14.1 Introduction .................................................................. 350 14.2 Case Studies .................................................................. 352 14.2.1 Comparing Two Doses with a Control for Two Hierarchical Endpoints ............................................. 352 14.2.2 Test Strategies for Composite Endpoints and Their components ................................................... 353 14.2.3 Testing Noninferiority and Superiority for Multiple Endpoints in a Combination Trial ................................ 353 14.3 Main Approach .............................................................. 354 14.3.1 Bonferroni-Based Graphical Test Procedures ................... 354 14.3.1.1 Heuristics .................................................. 354 14.3.1.2 Graphical Approach to Multiple Testing .............. 356 14.3.1.3 Adjusted p-Values ......................................... 359 14.3.1.4 Simultaneous Confidence Intervals .................... 360 14.3.1.5 Power and Sample Size Calculation ................... 361 14.3.2 Case Studies Revisited ............................................. 363 14.3.2.1 Comparing Two Doses with a Control for Two Hierarchical Endpoints .................................. 363 14.3.2.2 Test Strategies for Composite Endpoints and Their Components ........................................ 364 14.3.2.3 Testing Noninferiority and Superiority for Multiple Endpoints in a Combination Trial ........... 366 14.3.3 Software Implementations ......................................... 368 14.3.3.1 SAS .......................................................... 368 14.3.3.2 R ............................................................. 369 14.3.4 Graphical Visualization of Common Multiple Test Procedures ...................................................... 371 14.3.5 Technical Background .............................................. 374 14.4 Extensions .................................................................... 376 14.4.1 Parametric Graphical Test Procedures ........................... 376 14.4.2 Simes-Based Graphical Test Procedures ......................... 380 14.4.3 Graphical Approaches for Group Sequential Designs ......... 381 14.4.4 Graphical Approaches for Families of Hypotheses ............ 382 349

Transcript of Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should...

Page 1: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

14Graphical Approaches to Multiple Testing

Frank Bretz, Willi Maurer, and Jeff Maca

CONTENTS

14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35014.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

14.2.1 Comparing Two Doses with a Control for TwoHierarchical Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

14.2.2 Test Strategies for Composite Endpoints andTheir components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

14.2.3 Testing Noninferiority and Superiority for MultipleEndpoints in a Combination Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

14.3 Main Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35414.3.1 Bonferroni-Based Graphical Test Procedures . . . . . . . . . . . . . . . . . . . 354

14.3.1.1 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35414.3.1.2 Graphical Approach to Multiple Testing . . . . . . . . . . . . . . 35614.3.1.3 Adjusted p-Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35914.3.1.4 Simultaneous Confidence Intervals . . . . . . . . . . . . . . . . . . . . 36014.3.1.5 Power and Sample Size Calculation . . . . . . . . . . . . . . . . . . . 361

14.3.2 Case Studies Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36314.3.2.1 Comparing Two Doses with a Control for Two

Hierarchical Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36314.3.2.2 Test Strategies for Composite Endpoints and

Their Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36414.3.2.3 Testing Noninferiority and Superiority for

Multiple Endpoints in a Combination Trial . . . . . . . . . . . 36614.3.3 Software Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

14.3.3.1 SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36814.3.3.2 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

14.3.4 Graphical Visualization of Common MultipleTest Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371

14.3.5 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37414.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

14.4.1 Parametric Graphical Test Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 37614.4.2 Simes-Based Graphical Test Procedures . . . . . . . . . . . . . . . . . . . . . . . . . 38014.4.3 Graphical Approaches for Group Sequential Designs . . . . . . . . . 38114.4.4 Graphical Approaches for Families of Hypotheses . . . . . . . . . . . . 382

349

Page 2: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

350 Clinical Trial Biostatistics and Biopharmaceutical Applications

14.4.5 Entangled Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38414.4.6 Graphical Approaches for k-out-of-m

Gatekeeper Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38714.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390

14.1 Introduction

Regulatory guidelines for drug development suggest a strong control of thefamilywise error rate (FWER), when multiple hypotheses are simultaneouslytested in confirmatory clinical trials (ICH, 1998; CHMP, 2002). That is, theprobability to erroneously reject at least one true null hypothesis is con-trolled at a prespecified significance level α∈ (0, 1) under any configurationof true and false null hypotheses. A variety of multiple test procedures existthat control the FWER at the designated level α and the underlying the-ory is well developed (Dmitrienko et al., 2009; Bretz et al., 2010; Westfallet al., 2011). However, confirmatory studies are becoming increasingly morecomplex and often involve multiple statistical hypotheses that reflect struc-tured clinical study objectives. Typical examples include the simultaneousinvestigation of multiple doses or regimens of a new treatment, two or moreclinical endpoints, several populations, noninferiority and superiority, or anycombination thereof. Clinical teams are then faced with the difficult task ofstructuring these hypotheses to best reflect the clinical study’s objectives. Thistask comprises, but is not restricted to, the identification of the study’s pri-mary objective(s), its secondary objective(s), a decision about whether only asingle hypothesis is of paramount importance or several of them are equallyrelevant, the degree of controlling incorrect decisions, etc. In addition, pairsof primary and secondary objectives might be coupled and should thus beinvestigated hierarchically. For example, in a diabetes trial, a reduction in thepatients’ body weight may only be of interest if a reduction in the glycatedhemoglobin (HbA1c) level is achieved, but two different doses of a treatmentare equally relevant contenders for a dosage recommendation of a specificdrug. In this case, the hypothesis involving body weight reduction in the lowdose is a descendant of its parent’ primary hypothesis (HbA1c level reductionin the low dose).

A variety of standard multiple test procedures are available, such asthose by Bonferroni, Holm, Hochberg, and Dunnett. However, these pro-cedures are often not suitable for the advanced structured hypotheses testproblems mentioned earlier, because they treat all hypotheses equally anddo not address the underlying structure of the test problem. At the sametime, great care is advised when using ad hoc extensions of standard testprocedures, as they may not control the FWER at the designated level α.

Page 3: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 351

For example, consider a clinical trial comparing two doses of a new com-pound with placebo for a primary and a secondary endpoint, resulting in fournull hypotheses of interest, Hi, i = 1, . . . , 4. Let H1, H2 denote the two primaryhypotheses (both dose-placebo comparisons for the primary endpoint) andH3, H4 the two secondary hypotheses (both dose-placebo comparisons for thesecondary endpoint). Consider the following intuitive extension of the Holmprocedure: Test H1 and H2 with the Holm procedure at level α; if at least onehypothesis is rejected, test the descendant secondary hypothesis at level α/2.This procedure (and many variants thereof) does not control the FWER atlevel α. The actual FWER can be up to 3α/2 in this particular example.

In view of the increasing complexity of confirmatory study designs andobjectives, new classes of multiple test procedures have been developed inthe past years, such as fixed sequence, fallback, and gatekeeping procedures;see Alosh et al. (2014) for a recent review of advanced multiple test pro-cedures applied to clinical trials. Such procedures reflect the difference inimportance as well as the relationship between the various study objectives;see Hommel et al. (2007); Guilbaud (2007); Dmitrienko et al. (2008); Li andMehrotra (2008); Alosh and Huque (2010); Dmitrienko and Tamhane (2011);Dmitrienko et al. (2011); Kim et al. (2011); Luo et al. (2013), among manyothers. In this paper, we focus on the graphical approaches proposed byBretz et al. (2009) and Burman et al. (2009) to construct, visualize, andperform multiple test procedures that are tailored to the structured fam-ilies of hypotheses of interest. Using graphical approaches, vertices withassociated weights denote the individual null hypotheses and their local sig-nificance levels. Directed edges between the vertices specify how the localsignificance levels are propagated in case of significant results. The result-ing procedures control the FWER in the strong sense at the designated levelα across all hypotheses. Many standard multiple test procedures, includingsome of the recently developed gatekeeping procedures, can be visualizedand performed intuitively using graphical approaches.

In the meantime, graphical methods have been applied to different testproblems, such as combined noninferiority and superiority testing (Hungand Wang, 2010; Guilbaud, 2011; Lawrence, 2011), testing of compositeendpoints and their components (Huque et al., 2011; Rauch and Beyers-mann, 2013), and subgroup analyses (Bretz et al., 2011a). The descriptionof these methods has mostly focused on Bonferroni-based test procedures,although extensions have been proposed that include weighted Simes’or parametric tests (Bretz et al., 2011b; Maurer et al., 2011; Millen andDmitrienko, 2011). Further methodological extensions have been describedfor group sequential trials (Maurer and Bretz, 2013b), adaptive designs(Sugitani et al., 2013, 2014; Klinglmueller et al., 2014), families of hypotheses(Kordzakhia and Dmitrienko, 2013; Maurer and Bretz, 2014), and entan-gled graphical test procedures (Maurer and Bretz, 2013a). Power and samplesize considerations to optimize a graphical multiple test procedure for givenstudy objectives were given in Bretz et al. (2011a). Software solutions in

Page 4: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

352 Clinical Trial Biostatistics and Biopharmaceutical Applications

SAS and R were described in Bretz et al. (2011a,b). In the following sections,we describe in detail the graphical approach and illustrate it with the visu-alization of several common gatekeeping strategies. We also present severalcase studies to illustrate how the approach can be used in practice. We illus-trate the methods using the graphical user interface (GUI) from the gMCPpackage in R (Rohmeyer and Klinglmueller, 2014), which is freely availableon the Comprehensive R Archive Network (CRAN).

14.2 Case Studies

In this section, we introduce three case studies that will be revisited andextended later to illustrate some of the methods described in the sequel. Thesecase studies motivate the need for advanced methods to address multiplicityissues in clinical trials.

14.2.1 Comparing Two Doses with a Control for Two Hierarchical Endpoints

Consider a diabetes trial comparing two doses (low and high) against placebofor two hierarchically ordered endpoints (HbA1c level and body weight),resulting in two levels of multiplicity and four null hypotheses H1, H2, H3,and H4. In addition to the given family of null hypotheses, clinical consid-erations often lead to a structured hypotheses test problem subject to certainlogical constraints. Assume for our example that HbA1c is more importantthan body weight. Thus, the four hypotheses are grouped into two primaryhypotheses H1, H2 (both dose-placebo comparisons for HbA1c) and two sec-ondary hypotheses H3, H4 (both dose-placebo comparisons for body weight).Both doses are considered equally important, which rules out a full hierar-chy of testing first the high dose and, conditional on its significance, then thelow dose. In addition, it is required that a secondary hypothesis is not testedwithout having rejected the associated primary hypothesis (successivenessproperty; see Maurer et al. 2011; O’Neill 1997). That is, we consider {H1, H3}and {H2, H4} as pairs of parent–descendant hypotheses to reflect the hierarchyamong the two endpoints within a same dose. The objective is to test all fourhypotheses under strong FWER control while reflecting the clinical consid-erations mentioned above and without leading to illogical decisions (Hungand Wang, 2009, 2010). Standard multiple comparison procedures, such asthose by Bonferroni, Holm, or Dunnett, are not suitable here, because theytreat all four hypotheses equally and do not address the underlying structureof the test problem. Instead, one needs to construct test strategies that reflectthe complex clinical requirements on the structured hypotheses.

Page 5: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 353

14.2.2 Test Strategies for Composite Endpoints and Their Components

Composite endpoints are defined as a collection of (usually) low-incidenceoutcomes of interest and often analyzed as time-to-first occurrence of anyindividual component. They have to be distinguished from multicom-ponent endpoints, where the individual components are not of interestper se and rather used to assign a score or a responder status to each subject,such as sum scores (e.g., the positive and negative syndrome scale [PANSS]for schizophrenia) or responder definitions (e.g., the American College ofRheumatology scores [ACR-N] to measure change in rheumatoid arthritissymptoms).

One common area of application are cardiovascular (CV) trials, where thecomposite endpoints consist of events of different types such as CV mortal-ity (CV death), stroke, myocardial infarction, and hospitalization for urgentrevascularization procedures. The use of such composite endpoints maylead to a reduction of trial size and duration, avoiding at first glance themultiplicity issue arising from testing the individual outcomes. However,interpretation of study findings based on a composite endpoint can be prob-lematic. This is in particular true if a statistically significant treatment effectfor the composite endpoint was driven mainly by a soft endpoint, such as hos-pitalization, and treatment effects for the hard endpoints are small or even inthe opposite direction than that for the composite endpoint. Consequently,for an appropriate interpretation of study results, it is important to analyzethe treatment effects for the individual components.

The analysis of individual components is merely descriptive if it is basedon reporting mean responses, nominal confidence intervals and p-values orforest plots, together with the findings for the composite endpoint. In thiscase, no formal claim is intended for the component endpoints. In contrast,it is sometimes of interest to establish an efficacy claim for a key individ-ual component, such as CV death, by testing its corresponding hypothesis,say H2, after the composite endpoint hypothesis, say H1, is either rejectedor has missed slightly the significance level α. In such cases, multiplicityneeds to be formally taken into account. Standard multiple test proceduresmay not be appropriate again, as they treat both hypotheses H1 and H2 asequally important and test them individually regardless of the findings of theother hypothesis. We revisit this case study later and discuss alternative teststrategies that can be employed, depending on the underlying trial objectives.

14.2.3 Testing Noninferiority and Superiority for Multiple Endpointsin a Combination Trial

The aim of the Aliskiren Trial of Minimizing OutcomeS for Patients withHEart failuRE (ATMOSPHERE) study (Krum et al., 2011) is to evaluate theeffect of both aliskiren and enalapril monotherapy and aliskiren/enalaprilcombination therapy on CV death and heart failure hospitalization in patients

Page 6: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

354 Clinical Trial Biostatistics and Biopharmaceutical Applications

with chronic systolic heart failure, New York Heart Association (NYHA)functional class II–IV symptoms, and elevated plasma levels of B-type natri-uretic peptide (BNP).

Patients tolerant to at least 10 mg or equivalent of enalapril will undergo anopen-label run-in period where they receive enalapril then aliskiren. Approx-imately 7000 patients tolerating this run-in period will then be randomized1:1:1 to aliskiren monotherapy, enalapril monotherapy, or the combination.The primary objectives of ATMOSPHERE are to investigate whether (1) thealiskiren/enalapril combination is superior to enalapril monotherapy indelaying time-to-first occurrence of CV death or heart failure hospital-ization and (2) aliskiren monotherapy is superior or at least noninferiorto enalapril monotherapy on this endpoint. The secondary objectives areto evaluate whether aliskiren monotherapy and/or the combination ofaliskiren/enalapril is superior to enalapril monotherapy in (1) reducing theBNP level from baseline to 4 months and (2) improving the clinical sum-mary score as assessed by the Kansas city cardiomyopathy questionnairefrom baseline to 12 months. Other efficacy objectives are analyzed in anexploratory manner. Safety data will be collected during the ongoing trialand be part of the overall assessment.

This is a trial with complex and highly structured objectives. For the pri-mary comparison between the combination of aliskiren and enalapril withenalapril monotherapy, a superiority test will be performed. Further compar-isons between aliskiren and enalapril monotherapy include both superiorityand noninferiority assessments and will be formally investigated, togetherwith the additional secondary hypotheses. We will illustrate how to use thegraphical approach to visualize a suitable test strategy when revisiting thiscase study.

14.3 Main Approach

In this section, we introduce the core graphical approach, which can be usedto construct new and extend existing multiple test procedures. More specif-ically, we describe the graphical approach to Bonferroni-based sequentiallyrejective multiple test procedures from Bretz et al. (2009).

14.3.1 Bonferroni-Based Graphical Test Procedures

14.3.1.1 Heuristics

Assume that we are interested in testing m elementary null hypothesesH1, . . . , Hm, which may include primary, secondary, or any other hypothesesof interest. Let 0 ≤αi ≤α, i ∈ I = {1, . . . , m}, denote the local significance levels.

Page 7: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 355

That is, the overall significance level α is split across the m hypotheses suchthat

∑mi=1 αi ≤α. Finally, let pi denote the unadjusted p-value for hypothesis

Hi, i ∈ I.Any multiple test procedure for H1, . . . , Hm should guarantee strong

FWER control at level α, be tailored to the structured trial objectives (asreflected through the elementary hypotheses and their relationships), andhave good power. To this end, consider the following heuristic approach. Testthe m null hypotheses, each at its local significance level αi, i ∈ I. If a hypothe-sis Hi can be rejected at level αi (i.e., pi ≤αi), propagate its local level αi to theremaining, not yet rejected hypotheses according to a prespecified rule. Con-tinue testing the remaining hypotheses with the updated local significancelevels, thus possibly leading to further rejections with subsequent furtherpropagation of the local levels. This procedure is repeated until no furtherhypothesis can be rejected. In Section 14.3.1.2, we show that, after a suitableformalization, the resulting sequentially rejective multiple test proceduresindeed control the FWER strongly at level α.

We can visualize the resulting multiple test procedures using the followingconventions; see Figure 14.1 that also includes an example for m = 2 hypothe-ses. The m null hypotheses are represented as m weighted nodes, where theweights are given by the local significance levels αi, i ∈ I. The α-propagationrules are determined through weighted, directed edges: The weight associ-ated with a directed edge between any two nodes indicates the fraction of thelocal significance level at the initial node (tail) that is added to the significancelevel at the terminal node (head) if the hypothesis at the tail is rejected. We willillustrate the resulting graphical test procedures by revisiting in Section 14.3.2the case studies from Section 14.2 and by visualizing a variety of commonmultiple test procedures in Section 14.3.4.

α-Propagation throughweighted, directed edges

Split of significance level αinto weights α1, ... , αm

Hypotheses H1, ... , Hmrepresented as nodes

H1

H1 H2

H1

1

1

H2

H2

α2 = α2α1 = α

2

α2

α2

FIGURE 14.1Conventions for the graphical approach. Right column: Example with m = 2 null hypotheses.

Page 8: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

356 Clinical Trial Biostatistics and Biopharmaceutical Applications

Initial graph

H1 H2

α1 = 0.0125 α2 = 0.0125

p1 = 0.04 p2 = 0.01

1

1

H2 rejected

H1

α1 = 0.025

p1 = 0.04(a) (b)

FIGURE 14.2Numerical example of the graphical Holm procedure with m = 2 hypotheses and α= 0.025. (a)Initial graph. (b) Updated graph after rejecting H2.

As a matter of fact, Figure 14.1 already displays two common multiple testprocedures for m = 2 hypotheses. The middle graph visualizes the Bonfer-roni procedure: Each of the two hypotheses is tested at level α/2; if one ofthem is rejected (i.e., pi ≤αi =α/2 for at least one i ∈ {1, 2}), the other one con-tinues being tested at level α/2, since there are no edges connecting the twonodes and therefore, no α-propagation is foreseen. If α1 �= α2, we obtain theweighted Bonferroni test for m = 2. In the bottom graph, however, both nodesare connected and if H2 (say) is rejected at level α/2, its local level is propa-gated to H1 along the outgoing edge with weight 1. Consequently, H1 is testedat updated level α/2+α/2 =α. This is exactly the Holm procedure, which form = 2 rejects the null hypothesis with the smaller p-value if it is less than α/2and continues testing the other hypothesis at level α. Figure 14.2 provides anumerical example of the graphical Holm procedure with m = 2 hypotheses,α= 0.025 and unadjusted p-values p1 = 0.04, p2 = 0.01 for H1, H2, respectively.A numerical example of the graphical Holm procedure with m = 3 hypothe-ses, together with the visualization of the updated graphs after each rejection,is given in Bretz et al. (2009).

14.3.1.2 Graphical Approach to Multiple Testing

We now formalize the heuristic approach from Section 14.3.1.1 and state themain result. Let α= (α1, . . . ,αm) denote the vector of local significance levels,such that

∑mi=1 αi ≤α. Let G = (gij) denote an m × m transition matrix with

freely chosen entries gij. The transition weight gij determines the fraction ofthe local level αi that is allocated to Hj in case Hi was rejected. The transitionmatrix G thus fully determines the directed edges in a graph. We require thetransition weights to satisfy the regularity conditions

0 ≤ gij ≤ 1, gii = 0 andm∑

k=1

gik ≤ 1 for all i, j = 1, . . . , m. (14.1)

That is, the transition weights should be nonnegative, the sum of the tran-sition weights with tail on a same node is bounded by 1 and there areno elementary loops (edges where head and tail coincide). Based on the

Page 9: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 357

Algorithm 14.1 (Weighted Bonferroni tests)0. Set I = {1, 2, . . . , m}.1. Select a j ∈ I such that pj ≤αj and reject Hj; otherwise stop.2. Update the graph:

I → I \ {j}

α� →{

α� + αjgj�, for �∈ I,0, otherwise,

g�k →{ g�k+g�jgjk

1−g�jgj�, for �, k ∈ I, � �= k, g�jgj� < 1,

0, otherwise.

3. If |I| ≥ 1, go to Step 1; otherwise stop.

observed unadjusted p-values pi, i ∈ I = {1, . . . , m}, we then define a sequen-tially rejective test procedure through the following algorithm.

The initial levels α, the transition matrix G, and Algorithm 14.1 define aunique sequentially rejective test procedure that controls the FWER stronglyat level α (Bretz et al., 2009). The proof of this statement uses the fact that thegraph (α, G) and Algorithm 14.1 define a closed test procedure with weightedBonferroni tests for each intersection hypothesis. Moreover, the updatedsignificance levels generated by Algorithm 14.1 fulfill a mild monotonicitycondition that enables the construction of shortcuts for the resulting con-sonant closed test procedures. For the interested reader, we provide sometechnical background and relevant references in Section 14.3.5.

Note that sometimes several hypotheses Hi with pi ≤αi could be rejectedat the same iteration. Step 1 of Algorithm 14.1 does not specify how to select jin such cases. As a matter of fact, the resulting final set of rejected hypothesesis independent of how the index j is chosen. Thus, for all practical purposes,setting j = argmini ∈ I pi/αi in Step 1 of Algorithm 14.1 is a convenient solutionbut can be replaced by any other selection rule.

To illustrate the connection between Algorithm 14.1 and the proposed iter-ative graphs, consider Figure 14.3 for an example with m = 3 hypotheses. Forthe sake of concreteness, we assume that H1, H2 are two primary hypothe-ses (such as comparing two doses with a control for a primary endpoint) andH3 a single secondary hypothesis (such as comparing the pooled data fromboth doses with a control for a secondary endpoint). For the left graph inFigure 14.3, we have α= (

α2 , α2 , 0

)and

G =⎛⎜⎝

0 12

12

12 0 1

20 0 0

⎞⎟⎠ .

Page 10: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

358 Clinical Trial Biostatistics and Biopharmaceutical Applications

Initial graph

H1 H2

H3

α α2 2

0

12

12

12

12

H2 rejected

H1

H3

3α4

1

H1, H2 rejected

H3

α α4

FIGURE 14.3Example of a graphical multiple test procedure to illustrate Algorithm 14.1 with α= 0.025 andunadjusted p-values p1 = 0.015, p2 = 0.001, and p3 = 0.1.

Assume α= 0.025 and that the unadjusted p-values p1 = 0.015, p2 = 0.001and p3 = 0.1 have been observed. Then, p2 = 0.001 < 0.0125 = α

2 =α2, so thatj = 2 and we can reject H2 according to Step 1 of Algorithm 14.1. Applyingthe graph iteratively, node H2 is deleted and the associated significance levelα2 is propagated along the edges with tail on node H2. In our example, theassociated transition weights are g2� = 1

2 for �= 1, 3 and the updated vectorof significance levels becomes α= (

α2 + α

4 , 0, α4)

for α= 0.025. At the sametime, loose edges are reconnected and their weights renormalized to satisfythe regularity conditions (14.1), ultimately leading to the middle graph inFigure 14.3. These updates in the graph are essentially reflected and formal-ized in Step 2 of Algorithm 14.1. That is, the transition weight for the edgeconnecting the two remaining hypotheses H1 and H3 becomes

g13 =12 + 1

212

1 − 12

12

= 1.

Now, we return to Step 1 and test H1 and H3 at the updated local significancelevels. Since p1 = 0.015 < 0.01875 = 3α

4 =α1, we have j = 1 and can reject H1.After updating the graph again, we obtain the right graph in Figure 14.3 andH3 is tested at level α. Since p3 = 0.1 > 0.025 =α=α3, we retain H3 and theprocedure stops as no further rejection is possible.

We conclude this example with two remarks. First, assume that both H1and H2 could be rejected in the first iteration, for example, p1 = p2 = 0.001 < α

2 .As mentioned before, Step 1 of Algorithm 14.1 does not specify how to select j.The test decisions remain the same, whether one first rejects H1 and proceeds

Page 11: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 359

with updating the graph, or starts in the reversed order by first rejecting H2.In Figure 14.3, if one had decided to first remove the node for H1, the updatedgraph would look different to the middle graph. However, the proof in Bretzet al. (2009) ensures that the final test decisions on all three hypotheses remainthe same, regardless of the rejection sequence. Second, assume the unad-justed p-values p1 = 0.02, p2 = 0.001, and p3 = 0.005. Then we could reject bothH2 and H3 (in this sequence). The last remaining hypothesis H1 would thenbe tested at level α1 = 3α

4 , instead of α1 =α. The reason that the level is notexhausted is that there are no directed edges connecting H3 with either H1 orH2. That is, the third row of G has only 0 elements and once H3 is rejected,its level is not further propagated. Figure 14.3 is thus an example of a graphthat can be improved immediately (in this case, by inserting edges with tailon H3). While there are only few results available on optimally selecting theweights for multiple test procedures (Westfall and Krishen, 2001), a sufficientcondition for a graph being complete (in the sense that it cannot be improvedby adding additional edges) is that the weights of outgoing edges sum to1 at each node and every node is accessible from any of the other nodes.If αi > 0, i = 1, . . . , k, this is also a necessary condition for completeness.

14.3.1.3 Adjusted p-Values

Adjusted p-values are often used to describe the results of a multiple testprocedure. Adjusted p-values inherently incorporate the structure of theunderlying multiple test procedure, and once they are computed, they canbe compared directly with the overall significance level α. More formally, anadjusted p-value is the smallest significance level at which a given hypothesisis significant as part of the multiple test procedure (Westfall and Young, 1993).

In the following, we show that a slight modification of Algorithm 14.1

allows the calculation of m adjusted p-values padj1 , . . . , padj

m (Bretz et al., 2009).To this end, we assume for each J ⊆ I = {1, . . . , m} a collection of weights wj(J)such that 0 ≤ wj(J)≤ 1 and

∑j ∈ J wj(J)≤ 1. Setting w(I)= (w1(I), . . . , wm(I)) =(α1

α, . . . , αm

α

), we obtain the same initial local significance levels when multi-

plying the weights with α, as used in Algorithm 14.1. The remaining weightswj(J), J � I, are obtained by updating the graph iteratively in the same wayas before.

To illustrate Algorithm 14.2, we revisit the numerical example from

Figure 14.3. Here, w(I)=(

12 , 1

2 , 0)

. At the first iteration, j = 2 and padj2 =

max{

0.0010.5 , 0

}= 0.002. After updating the graph, we obtain at the sec-

ond iteration j = 1 with padj1 = max

{0.0150.75 , 0.002

}= 0.02. Finally, padj

3 =max{0.1, 0.02} = 0.1. Thus, we can reject H1 for any significance levelα≥ 0.02,reject H2 for any α≥ 0.002, and reject H3 for any α≥ 0.1. It should be noted

Page 12: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

360 Clinical Trial Biostatistics and Biopharmaceutical Applications

Algorithm 14.2 (Adjusted p-values)0. Set I = {1, 2, . . . , m} and pmax = 0.

1. Let j = arg mini ∈ Ipi

wi(I), calculate padj

j = max{

pjwj(I)

, pmax

}, and set

pmax = padjj .

2. Update the graph:

I → I \ {j}

w�(I) →{

w�(I) + wj(I)gj�, for �∈ I,0, otherwise,

g�k →{ g�k+g�jgjk

1−g�jgj�, for �, k ∈ I, � �= k, g�jgj� < 1,

0, otherwise.

3. If |I| ≥ 1, go to Step 1; otherwise stop.

4. Reject all hypotheses Hj with padjj ≤α.

that the test decisions obtained from Algorithm 14.2 are exactly the same asthose from Algorithm 14.1 for a fixed α.

14.3.1.4 Simultaneous Confidence Intervals

Algorithm 14.1 can also be used to construct compatible simultaneous con-fidence intervals (Guilbaud, 2008; Strassburger and Bretz, 2008). Considerthe one-sided null hypotheses Hi : θi ≤ δi, i ∈ I = {1, . . . , m}, where θi arethe parameters of interest (e.g., treatment means or contrasts thereof) andδi are prespecified constants (e.g., noninferiority margins). Let αj( J)=αwj(J)denote local significance levels with j ∈ J ⊆ I. Further, let Li(γ) denote local(i.e., marginal) lower confidence bounds for θi at level 1 − γ for i ∈ I. Finally,let R denote the index set of hypotheses rejected by a multiple test procedurespecified through a graph (α, G).

Following Strassburger and Bretz (2008), lower one-sided confidencebounds for θ1, . . . ,θm with simultaneous coverage probability of at least 1−α

are given by

Li =⎧⎨⎩

δi, for i ∈ R and R �= I,Li(αi), for i �∈ R,max(δi, Li(αi)), for R = I,

where αi =αi(I \ R) for i �∈ R �= I denotes the local significance level for Hi inthe final graph when applying Algorithm 14.1. If all hypotheses can be rejected

Page 13: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 361

(i.e., R = I), the choice of the local levels αi =αi(∅) is free. Thus, in order tocompute the simultaneous confidence bounds, one only needs to know the setR of rejected hypotheses and the corresponding local levels αi for all indices iof retained hypotheses. Note that if not all hypotheses are rejected, the confi-dence bounds associated with the rejected hypotheses reflect the test decisionθi > δi and the confidence limits associated with the retained hypotheses arethe marginal confidence limits at level αi(I \ R). In other words, unless R = I,the simultaneous confidence intervals for the rejected hypotheses do not pro-vide any further information beyond the test decision, which limits their usein practice.

To illustrate the calculation of the simultaneous confidence intervals, werevisit the numerical example from Figure 14.3, assuming δi = 0 for all i.Recall from Section 14.3.1.2 that both H1 and H2 can be rejected at levelα= 0.025. Thus, R = {1, 2} � I and L1 = L2 = 0. As seen from Figure 14.3,α3 =α and L3 reduces to the marginal lower confidence bound at level 1 −α.

14.3.1.5 Power and Sample Size Calculation

Determining the sample size is an integral part of designing clinical studies. Ifa single null hypothesis is tested, for example, when comparing a new treat-ment against a control for a single primary endpoint in a two-armed trial,sample size is usually based on achieving a prespecified power for a spe-cific parameter configuration under the alternative hypothesis. But assumein this example that in addition to the primary endpoint, there is interest inassessing the benefit of the new treatment for a single secondary endpoint.Should then the sample size be determined on achieving a prespecified powerto declare both the primary and secondary endpoints significant or just theprimary endpoint? The traditional power concept can be generalized in var-ious ways when moving from single to multiple hypotheses test problems.Several authors have introduced a variety of power concepts related to dif-ferent win criteria; see Maurer and Mellein (1988); Xiong et al. (2005); Sennand Bretz (2007); Sozu et al. (2010); Chen et al. (2011); Julious and McIntyre(2012), among many others. It is not always clear which of these criteria isbest suited in practice. In this section, we provide some considerations forpower and sample size calculation in clinical trials with multiple objectivesthat are divided into primary and secondary objectives.

Having multiple primary and secondary objectives in a single trial, itbecomes important to distinguish between the probability for a successfultrial, as driven by the primary objectives, and the power to reject the indi-vidual null hypotheses. To reflect these two objectives, we propose to (1) firstselect a general test strategy addressing the study objectives specified in theprotocol, and (2) subsequently fine-tune it based on the importance relation-ships among the primary and secondary hypotheses, as induced by the studyobjectives, and the prior assumptions about the effect sizes for all primaryand secondary variables (Bretz et al., 2011a). Using a graphical approach, the

Page 14: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

362 Clinical Trial Biostatistics and Biopharmaceutical Applications

H1 H2

H3

α1 α2

α − α1 − α2

g13

1 − g13

g21

1 − g21

g32

1 − g32

FIGURE 14.4Graphical multiple test procedure from Figure 14.3 revisited.

initial significance levels αi and transition weights gij define the multiple testprocedure and thus the sample size under a fixed parameter configuration.Clinical considerations will guide the discussions about the general test strat-egy and possibly support the weight specifications. Clinical trial simulationsare then necessary to further fine-tune the weights, understand the operatingcharacteristics of the resulting multiple test procedure, including its robust-ness properties against deviations of the initial assumptions, and based onthis determine an appropriate sample size.

To illustrate these concepts, consider again the example in Figure 14.3.Once the clinical team has agreed that H1, H2 are the two primary and H3is the single secondary hypothesis, this leaves five weights to be specified:α1, α2 (which determine α3 =α − α1 − α2) and g13, g21, g32 (which deter-mine g12 = 1 − g13, g23 = 1 − g21, g31 = 1 − g32, respectively); see Figure 14.4.It seems natural to declare this trial successful, if at least one of the two pri-mary hypotheses H1 and H2 is rejected. Further significant results are nice tohave but not essential for claiming trial success. Consequently, the relevantprimary power measure should be the probability for a successful trial, thatis, the probability of rejecting either H1 or H2 at their local significance levelsα1 and α2, if they are in fact not true, which in turn implies setting α3 = 0 tomaximize that probability.

However, there may be situations where success in H1 and H2 is not suf-ficient. For example, if H3 is critical to achieve an important label claim, adream outcome of the trial would be to reject

{H1 and H3

}or

{H2 and H3

}.

That is, the trial is considered truly successful if at least one of the two pri-mary hypotheses H1 and H2 is rejected, followed by a rejection of H3. Thesample size needed to achieve this dream outcome is obviously larger thanachieving a success in one of the primary hypotheses alone. Thus, it is criti-cal to understand from clinical team discussions what a successful trial trulymeans and formalize accordingly the objective function for the sample sizecalculation.

Page 15: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 363

The gMCP package described in Section 14.3.3.2 offers a convenient inter-face to perform power calculations for any graphical test procedures. Theuser can enter any tailored power function using available buttons from theicon panel. For example, the expression

(x[1] || x[2]) && x[3]

calculates the probability that the first or (not exclusive) second hypothesesis rejected, together with the third one, where x[i] specifies the propositionthat hypothesis Hi is rejected. In addition, any valid R command can be used,such as any(x) to see whether any hypothesis is rejected or all(x[1:3])to see whether all of the first three hypotheses are rejected.

14.3.2 Case Studies Revisited

In this section, we revisit the case studies from Section 14.2 to illustrate thegraphical approach described previously.

14.3.2.1 Comparing Two Doses with a Control forTwo Hierarchical Endpoints

We follow the general outline proposed in Section 14.3.1.5 to first identifya suitable general test strategy addressing the study objectives and subse-quently fine-tune it. Such an approach reduces the problem of specifyinga suitable multiple test procedure to the determination of a graph (α, G),accounting for the importance relationships among the primary and sec-ondary hypotheses. To start with, we consider the initial levels αi. In orderto reflect the hierarchy between the two endpoints within a given dose, weassign weights 0 to the secondary hypotheses and split the significance levelαequally across both doses, since both doses are considered equally importantin this case study. Therefore, α3 =α4 = 0 and α1 =α2 = α

2 .Next, we determine the initial transition weights gij. There are in total 12

possible edges to connect any two nodes in order to specify how the signif-icance levels are propagated after a hypothesis has been rejected. However,this number of edges can be reduced substantially by taking clinical consider-ations into account. In our example, the edges H3 → H1 and H4 → H2 receiveweight 0 because of the hierarchy among the two endpoints within a givendose. In addition, if successiveness is required (Section 14.2.1), there are noedges H1 → H4, H2 → H3, H3 → H4, and H4 → H3, as otherwise one canalways construct examples where for a given dose, the secondary hypothe-sis is rejected, but the associated primary hypothesis is not. This leaves uswith the six edges displayed in the left graph of Figure 14.5. As the sum ofthe weights over all outgoing edges for a given node should not be greaterthan 1, this gives g41 = g32 = 1 and we are left with two remaining weights g12,g21 to be determined. Their choice can be based on different considerations.

Page 16: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

364 Clinical Trial Biostatistics and Biopharmaceutical Applications

HbA1c

Bodyweight

Low dose High dose

H1 H2

H3 H4

g12

g21

0 0 0 0

1 1

(a) (b)

1 − g12 1 − g21

Low dose High dose

H1 H2

H3 H4

α2

α2

α2

α2

1 1

1 1

FIGURE 14.5Graphical visualization of a viable multiple test procedure for the diabetes case study. (a) Graph-ical visualization of a viable multiple test procedure for the diabetes case study. (b) Resultinggraph for the case g12 = g21 = 0.

If, for example, safety is a major concern, one might prefer testing the pri-mary endpoint for the low dose, in case high dose is significant but not safe(leading to a large value of g21). Otherwise, if safety is not of major con-cern, one might prefer giving the secondary endpoints more weight insteadof propagating a large fraction of the significance level to the other primaryhypothesis, thus leading to small values for g12 and g21. The right graph inFigure 14.5 displays the resulting graph for the extreme case g12 = g21 = 0. Ifboth the primary and secondary endpoints for a given dose are rejected atα2 , then the other dose can be tested at a level α. This can be interpreted as aHolm type procedure applied to the families of hypotheses per dose, {H1, H3}and {H2, H4}. Finally, if no preference for the choice of g12 and g21 is at hand,based on the available clinical considerations, numerical optimization can beused to determine their values in order to maximize the power of the mul-tiple test procedure (Bretz et al., 2011a), although in many cases, the powerdoes not depend dramatically on the selected weights (Wiens et al., 2013).

14.3.2.2 Test Strategies for Composite Endpoints and Their Components

A simple way of decomposing a composite endpoint is to prioritize its com-ponents and employ a hierarchical test procedure. As in Section 14.2.2, letH1 denote the composite endpoint hypothesis and H2 the key individualcomponent hypothesis (e.g., for CV death). Accordingly, we test H1 at levelα and only if this is rejected, we proceed with testing H2, also at level α;see Figure 14.6a. However, a hierarchical test procedure might not alwaysbe appropriate, as failure to reject the hypothesis of the composite end-point H1 prohibits testing the individual endpoint hypothesis H2, even if the

Page 17: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 365

(a)

H1 H2

α 01

(b)

H1 H2

α1 α21

1

(c)

H1 H2

H3

0.02 0.005

0

1−ε

1

ε

FIGURE 14.6Test strategies for composite endpoints and their components (a–c details are given in the text).

associated p-value p2 is very small. One might instead prefer α2 > 0 so thatH2 can be tested even if H1 is not rejected. Such approach in turn comes onlyat the cost of reduced power for the composite endpoint, since H1 will haveto be tested at the smaller level α1 =α − α2, although the overall power toreject at least one of the two hypotheses increases if α2 > 0, in particular ifthe correlation between the respective test statistics is small. Figure 14.6bdisplays the graphical visualization of the resulting test procedure, whichturns out to be a weighted version of the Holm procedure from Figure 14.2.Note that applying such strategy may lead to difficulties in interpretation ifthe critical component trends in the wrong direction. In such cases, one mayconsider introducing a prespecified consistency criterion to ensure clinicallymeaningful results (Huque et al., 2011). Similarly, Alosh et al. (2014) consid-ered multiple testing strategies that allow testing H2 as long as the result oftesting H1 establishes a prespecified minimum level of efficacy. Furthermore,the significance level for testing the mortality hypothesis H2 can be adaptedto the findings of testing the composite endpoint hypothesis H1. Extensionsto more complex testing strategies involving multiple components are alsopossible (Rauch and Beyersmann, 2013).

One question that remains for future discussion is on the rationale of usingcomposite endpoints. Instead of using an aggregate measure that needs to be

Page 18: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

366 Clinical Trial Biostatistics and Biopharmaceutical Applications

decomposed in many cases anyway, one may wonder whether the individualcomponents can be treated as multiple primary endpoints (and thus replacethe composite endpoint), where study success is defined as at least one of themultiple components being significant. In such cases, the classical multiplic-ity problem is at hand, since there are multiple chances of winning on at leastone endpoint. But having clearly separated endpoints and decision criteria(i.e., win scenarios) might outweigh the disadvantage of having to adjust theendpoints for multiplicity.

More recently, recurrent event data analyses have attracted the interestas an alternative to traditional time-to-first event analyses. For the sake ofconcreteness, assume two components, one being a recurrent event process(e.g., hospitalization) and the other one being a terminal event process (e.g.,CV death). Testing these two components as a composite endpoint does notrequire multiplicity adjustment and a possible label claim could be drug Xreduces the rate of primary composite endpoint events consisting of CV death andhospitalization. However, additional testing of the components is highly desir-able, but requires multiplicity adjustment (in particular if a claim is soughtdespite a negative composite endpoint outcome). To this end, let H1 denotethe composite endpoint hypothesis. Further, let H2 and H3 denote the twocomponent hypotheses for hospitalization and CV death, respectively, whichcould be tested using estimates from a joint frailty model (Liu et al., 2004;Cowling et al., 2006). Using the graphical approach, suitable multiple teststrategies can be considered and tailored to given clinical study objectives.Figure 14.6c displays a possible sequentially rejective graphical test proce-dure, which splits the initial significance level of α= 0.025 (say) unequallyacross H1 and H2 and tests them using essentially a weighted Holm test.Moreover, CV death is tested only if both the composite and hospitalizationendpoints are rejected, which is a consequence of the infinitesimal weightg13 = ε chosen for the edge H1 → H3 (see Section 14.3.4 for a more formalintroduction of such weights). Alternatively, one could choose a truly pos-itive weight g13 in order to avoid that a strong effect in CV death is diluted bya lower effect in hospitalization, in which case there would be an increasedchance of a nonsignificant composite effect.

14.3.2.3 Testing Noninferiority and Superiority for Multiple Endpoints in aCombination Trial

Recall from Section 14.2.3 that we are interested in comparing thealiskiren/enalapril combination therapy C and aliskiren monotherapy Awiththe enalapril monotherapy E, resulting in several superiority and noninferi-ority assessments for the single primary and multiple secondary endpoints.More specifically, we have three single null hypotheses and two subfamiliesof null hypotheses for which an appropriate multiple test procedure has tobe constructed:

Page 19: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 367

H1 H2

H3

4

Primary

Secondary

Treatment C Treatment A

α/2 α/2

0

34

14

34

1/4

1

1

10 0

5

FIGURE 14.7Graphical illustration of the test strategy for noninferiority and superiority in a combination trialwith multiple endpoints.

H1 : superiority of C versus EH2 : noninferiority of A versus EH3 : superiority of A versus EH4 : multiple secondary variables for C versus EH5 : multiple secondary variables for A versus E

Figure 14.7 visualizes the multiple test procedure resulting from a seriesof interactive discussion with the clinical team on how to structure the clini-cally relevant hierarchies reflecting the study objectives. Note that H4 and H5are subfamilies of multiple secondary hypotheses. Appropriate multiple testprocedures have to be constructed for each of these two subfamilies but arenot discussed here for the sake of brevity.

Some of the considerations leading to Figure 14.7 are as follows. Noneof the secondary hypotheses in H4 and H5 are initially assigned a posi-tive significance level. Any secondary hypothesis can only be tested if atleast the associated primary objective has been achieved before. The graphin Figure 14.7 reflects the natural parent–descendant hierarchy within eachtreatment C and A: H4 can only be tested if H1 was significant, and similarly,H3 andH5 can only be tested if H2 was significant. The actual significance lev-els depend on the test results for the entire family of hypotheses. In particular,the superiority hypothesis H3 can only be tested if the associated noninferi-ority hypothesis H2 was rejected before (any other decision strategy wouldprobably be illogical). Finally, note that if all individual null hypotheses in H4or H5 are rejected, the local significance level is propagated to the hypothe-ses sequence for the other treatment. If needed, this propagation rule couldbe modified by displaying the individual secondary hypotheses and spec-ifying propagation rules for each of the nodes. Alternative approaches to

Page 20: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

368 Clinical Trial Biostatistics and Biopharmaceutical Applications

propagate significance levels between families of hypotheses are discussedin Section 14.4.4.

14.3.3 Software Implementations

In this section, we review SAS and R implementations of the graphicalapproach. We revisit some of the previous examples and provide relevantexample calls.

14.3.3.1 SAS

SAS/IML functions are available for both Algorithms 14.1 and 14.2; see Bretzet al. (2011a) and Alosh et al. (2014), respectively. Both functions are straight-forward to use and can be applied to any graphical test procedure introducedin Section 14.3.1. To illustrate their functionality, we revisit the diabetescase study from Section 14.2.1. More specifically, we consider the graphin Figure 14.5 with g12 = g21 = 1

2 . We let α= 0.025 and assume the unad-justed p-values p1 = 0.1, p2 = 0.001, p3 = 0.0001, and p4 = 0.005, which could beobtained from standard statistical procedures like PROC GLM or PROC MIXED.

In order to execute the mcp function from Bretz et al. (2011a) for Algo-rithm 14.1, we specify

h = {0 0 0 0};

a = {0.0125 0.0125 0 0};

w = {0 0.5 0.5 0 ,

0.5 0 0 0.5,

0 1 0 0 ,

1 0 0 0 };

p = {0.1 0.001 0.0001 0.005};

where h is a 1 × m vector indicating whether a hypothesis is rejected (=1) ornot (=0), a 1 × m vector α with the initial significance level allocation, w isthe m × m matrix G with the transition weights, and p is a 1 × m vector withthe unadjusted p-values. Calling

run mcp(h, a, w, p);

we then conclude from the output

h

0 1 0 1

that we can reject H2 and H4; see Figure 14.8 for the iterated graphs.Note that H3 cannot be rejected despite its very small p-value, whichis consistent with the successiveness requirement stated in Section 14.2.1.

Page 21: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 369

Initial graph

H1 H2

H3 H4

0.0125 0.0125

11

H2 rejected H2, H4 rejected

H1

H3 H4

0.01875

0.0625

1

H1

H3

0.025

0000

1 112

12

12

12

12

12

13

23

FIGURE 14.8Numerical example for the diabetes case study with unadjusted p-values p1 = 0.1,p2 = 0.001, p3 = 0.0001, and p4 = 0.005.

The output for the mcp function also includes the updated significancelevels after each iteration as well as the transition matrix after the last iter-ation, but we omit it here for brevity. Finally, we note that the modifiedmcp function from Alosh et al. (2014) for Algorithm 14.2 gives the adjusted

p-values padj1 = 0.1, padj

2 = 0.002, padj3 = 0.1, and padj

4 = 0.02, leading to the sametest decisions for α= 0.025.

14.3.3.2 R

The gMCP package (Rohmeyer and Klinglmueller, 2014) in R offers aGUI to conveniently construct and perform graphical multiple comparisonprocedures. The latest version of gMCP is available at CRAN and can bedownloaded from http://cran.r-project.org/package=gMCP/; see also theinstallation instructions at http://cran.r-project.org/web/packages/gMCP/INSTALL.

One way of starting a session is to invoke in R the gMCP package andsubsequently call the GUI with

> library(gMCP)

> graphGUI()

Different buttons are available in the icon panel of the GUI to create a newgraph. The main functionality includes the possibility of adding new nodesas well as new edges connecting any two selected nodes. In many cases, theedges will have to be dragged manually in order to improve the readabilityof the graphs. The associated labels, weights, and significant levels can beedited directly in the graph. Alternatively, the numerical information can be

Page 22: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

370 Clinical Trial Biostatistics and Biopharmaceutical Applications

FIGURE 14.9Screenshot of the GUI from the gMCP package. Left: Display of the graphical Bonferroni-basedtest procedure from Figure 14.6c. Right: Transition matrix, initial weights, and unadjustedp-values.

entered into the transition matrix and other fields on the right-hand side ofthe GUI. Figure 14.9 provides a screenshot of the GUI from the gMCP package,displaying the graphical Bonferroni-based test procedure from Figure 14.6c.

The gMCP package offers, among other features, the followingfunctionality:

• Create graphs with drag n drop or directly in R• Perform graphical multiple test procedures based on Bonferroni,

Simes, and parametric tests• Compute adjusted p-values and simultaneous confidence intervals

for Bonferroni-based graphical test procedures• Perform power calculations based on user-defined objective func-

tions• Produce S4 objects for the graphs and the corresponding tests• Export single graphs or produce full reports in LaTeX and PDF/PNG• Browse through a large collection of example graphs from the

literature

We refer to the accompanying vignettes for a complete description of thefunctionality. A brief illustration of the gMCP package with a cardiovascularclinical trial example is given in Bretz et al. (2011b).

Page 23: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 371

14.3.4 Graphical Visualization of Common Multiple Test Procedures

In the previous sections we already visualized several common multiple testprocedures, such as the Bonferroni test in Figure 14.1, the ordinary Holm pro-cedure in Figures 14.1 and 14.2, the weighted Holm procedure in Figure 14.6b,and the hierarchical test procedure in Figure 14.6a and c. In this section, wevisualize further Bonferroni-based multiple test procedures from the litera-ture. Other procedures, such as the truncated Holm procedure and k-out-of-ngatekeeping, will be visualized in Section 14.4 after a suitable extension ofthe graphical approach described so far. While it becomes transparent thatmany common multiple test procedures can be displayed using the graphicalapproach, its main advantage is the flexibility to construct and visualize tai-lored test strategies to address advanced multiplicity issues in clinical trials.This degree of flexibility is demonstrated with the case studies in Sections 14.2where novel test procedures had to be derived to meet the complex clinicaltrial objectives.

Figure 14.10a displays the fixed sequence test procedure (Westfall andKrishen, 2001) for m = 3 hypotheses with α1 =α and α2 =α3 = 0. The firsthypothesis H1 is tested at level α. If rejected, its level is propagated to thesecond hypothesis H2, and so on. The fixed sequence test procedure con-trols the FWER in the strong sense and is often used in practice because ofits simplicity. However, once a hypothesis is not rejected, no further testingis permitted and care has to be taken when specifying the testing sequenceprior to a study.

H1 H2 H3

H3

H3

α 001 1

H1 H2

α1 α2 α21 1

H1

α3

2α3

1

(a)

(b)

FIGURE 14.10Visualization of the (a) hierarchical and (b) fallback procedures for m = 3 hypotheses. Uppergraph in (b): Fallback procedure with local levelsαi, i = 1, 2, 3. Lower graph in (b): Updated graphafter rejecting H2, with αi = α

3 .

Page 24: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

372 Clinical Trial Biostatistics and Biopharmaceutical Applications

The fallback procedure alleviates these concerns. It reserves some frac-tion of the significance level α for the later hypotheses in the sequence andthus allows one to test those even if the initial hypotheses in the sequenceare not rejected; see the upper graph of Figure 14.10b for a visualization,where α1 + α2 + α3 ≤α. For illustration, assume equal local significance lev-els α1 =α2 =α3 = α

3 and the three unadjusted p-values p1 = 0.015, p2 = 0.001,and p3 = 0.02. If we set α= 0.025, then p2 = 0.001 < 0.0083 =α2. Accordingly,we can reject H2 and propagate its significance level to H3; see the lower graphin Figure 14.10b. The procedure stops at this stage, since no further hypoth-esis can be rejected. Alternatively, one can compute the adjusted p-values

padj1 = 0.045, padj

2 = 0.003, and padj3 = 0.03 using Algorithm 14.2, which lead to

the same test decisions.Note that in Figure 14.10b, the local level α1 remains unchanged, even

if we would have rejected both H2 and H3. This is because after rejectingH3, its local level is not further propagated. Similar to the test procedurefrom Figure 14.3, the original fallback procedure is not complete and can beimproved by adding one or more edges with tail on the last hypothesis in thesequence. Two such improvements are displayed in Figure 14.11. In the uppergraph (a), the local significance level α3 is propagated along the two edgespointing back to H1 and H2, where γ=α2/ (α1 + α2). The resulting test pro-cedure is equivalent to the α-exhaustive extension of the fallback procedureintroduced in Wiens and Dmitrienko (2005). Revisiting the numerical exam-

ple from Figure 14.10b, the adjusted p-values become padj1 = 0.03, padj

2 = 0.003,

and padj3 = 0.03 and one observes that padj

1 is now smaller than for the originalfallback procedure.

Figure 14.11b displays a second extension by propagating the significancelevel to the first hypothesis in the hierarchy that has not been rejected so far

(a)

H1 H2 H3

α1 α2 α3

11

1 – γ

γ

(b)

H1 H2 H3

α1 α2 α31

1

1 – ε

ε

FIGURE 14.11Two extensions of the original fallback procedure for m = 3 hypotheses (a,b: details are given inthe text).

Page 25: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 373

(Hommel and Bretz, 2008). Here, ε denotes an infinitesimally small weight,indicating that the significance level is propagated from H2 to H3 only if bothH1 and H2 are rejected. The motivation for this extension is that H1 is deemedmore important than H3 as it has been placed earlier in the test sequence.Thus, once H2 is rejected, its associated significance level is propagated firstto H1 before H3 gets a second chance to be tested. More formally, when updat-ing the transition weights gij in the graph according to Algorithms 14.1 or 14.2,ε is treated as a variable representing some fixed positive real number. For thecomputation of the updated significance levels αi (Algorithm 14.1) or weightswi (Algorithm 14.2), we let ε → 0. For all real numbers x, we further setx + ε= x, xε= 0, ε0 = 1, and for all nonnegative integers k, l,

εk

εl=

⎧⎨⎩

0, if k > l,1, if k = l,∞, if k < l;

see Bretz et al. (2009) for an introduction of ε-edges for the purpose ofpropagating significance levels between families of hypotheses. With theseformalities, we can continue using Algorithms 14.1 and 14.2 without any

changes. In particular, we obtain the adjusted p-values padj1 = 0.0225, padj

2 =0.003, padj

3 = 0.0225 for the previous numerical example and can reject allthree null hypotheses at level α= 0.025.

In a similar way, several gatekeeper procedures can be constructed andvisualized using the graphical approach. Applying a serial gatekeeper pro-cedure, all null hypotheses of a family of hypotheses must be rejected beforeproceeding in the test sequence (Maurer et al., 1995; Bauer et al., 1998;Westfall and Krishen, 2001). Figure 14.6c visualizes an example of aserial gatekeeper procedure with two families F1 = {H1, H2} F2 = {H3},where the symbol indicates that all hypotheses of F1 must be rejectedbefore proceeding with F2. Figure 14.10a visualizes another example ofa serial gatekeeper procedure with F1 = {H1} F2 = {H2} F3 = {H3}. Incontrast, applying a parallel gatekeeper procedure, at least one null hypoth-esis of a family must be rejected in order to proceed to the next familyDmitrienko et al. (2003). Consider as an example two families of hypothesesF1 = {H1, H2} F2 = {H3, H4} such that the hypotheses in F2 are tested onlyif at least one of the hypotheses inF1 is rejected. Figure 14.12 displays the par-allel gatekeeper procedure from Dmitrienko et al. (2003), which assigns equallevels α

2 to the two primary hypotheses H1, H2 and levels 0 to the secondaryhypotheses H3, H4. If H1 or H2 is rejected, the corresponding local level α

2 issplit into half and propagated to H3 and H4 as indicated by the directed edgeswith weights 1

2 . If H3 (H4) is rejected in the sequel at its local significancelevel (either α

2 or α4 ), this level is propagated to H4 (H3) as indicated by the

directed edges with weights 1. Note that the procedure in Figure 14.12 is nei-ther complete nor successive. That is, it can be improved uniformly by adding

Page 26: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

374 Clinical Trial Biostatistics and Biopharmaceutical Applications

H1 H2

H3 H4

α2

00

1

1

12

12

12

12

α2

FIGURE 14.12Graphical visualization of the parallel gatekeeper procedure from Dmitrienko et al. (2003) withtwo families F1 = {

H1, H2} F2 = {

H3, H4}.

directed edges from F2 back to F1 (Bretz et al., 2009), and it does not preservepotential parent–descendant relationships (i.e., the secondary hypotheses canbe tested regardless of which primary hypothesis is rejected).

14.3.5 Technical Background

We now provide some technical background for the main results inSection 14.3.1. For the interested reader, we provide references to the liter-ature for further details.

Closed testing offers a general framework to construct powerful multi-ple test procedures for any finite set of null hypotheses Hi, i ∈ I ={1, . . . , m}(Marcus and Gabriel, 1976). Using closed testing, we consider the family

H= {HJ = ∩i ∈ JHi : J ⊆ I, HJ �= ∅ }

,

of all nonempty intersection hypotheses HJ. We further prespecify for eachH ∈H an α-level test. The resulting closed test procedure rejects H ∈H if allnonempty intersection hypotheses H′ ⊆ H are rejected by their correspond-ing α-level tests. By construction, closed test procedures control the FWERin the strong sense at level α. In what follows, we assume the elementaryhypotheses to satisfy the free combination condition, that is, for any sub-set J ⊆ I the simultaneous truth of Hi, i ∈ J, and falsehood of the remaininghypotheses is possible (Holm, 1979). For related results under restricted com-binations, where the previous condition does not hold, we refer to Brannathand Bretz (2010); Maurer and Klinglmüller (2013).

In the following, we assume for each intersection hypothesis HJ a collectionof weights wi(J) such that 0 ≤ wi(J)≤ 1 and

∑i ∈ J wj(J)≤ 1 for J ⊆ I. These

Page 27: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 375

weights quantify the relative importance of the hypotheses Hi included inthe intersection HJ. Moreover, we test HJ using a weighted Bonferroni test.That is, an intersection hypothesis HJ is rejected if pi ≤ wi(J)α=αi, for at leastone i ∈ J ⊆ I. This defines the class B of all closed test procedures that useweighted Bonferroni tests for each intersection hypothesis.

A closed test procedures is said to be consonant if the following conditionis satisfied: If an intersection hypothesis HJ is rejected, there is an index i ∈ J,such that the elementary hypothesis Hi can be rejected as well (Gabriel, 1969).Consonance is a desirable property as it ensures the rejection of an elementarynull after rejecting the global null hypothesis HI. In particular, consonanceenables the construction of sequentially rejective shortcut procedures suchthat the elementary hypotheses H1, . . . , Hm are tested in m steps instead thatall 2m − 1 intersection hypotheses are tested as usually required by closedtesting.

Consider now the subclass S ⊂ B of closed weighted Bonferroni testssatisfying the monotonicity condition

wj( J) ≤ wj(J′) for all J′ ⊆ J ⊆ I and j ∈ J′. (14.2)

It can be shown that this condition ensures consonance and admits a shortcutprocedure (Hommel et al., 2007). That is, any procedure in S can be per-formed using the following sequentially rejective algorithm (Algorithm 14.3).

What remains is to define a suitable collection of weights wi(J), J ⊆ I, thatsatisfies the monotonicity condition (14.2) and is tailored to given study objec-tives. To this end, it can be shown that any initial graph (α, G) applied to agiven set of hypotheses (nodes) together with the updating rules in Step 2 ofAlgorithm 14.1 generates a unique set of local significance levels. These localsignificance levels define weighted Bonferroni tests for the correspondingintersection hypotheses satisfying the monotonicity condition (14.2). Apply-ing the shortcut procedure from Algorithm 14.3 to these local significancelevels and the updating rules is then equivalent to Algorithm 14.1. In otherwords, the graphical approach defines a class G ⊂ S of sequentially rejec-tive Bonferroni-based closed test procedures, where the vector α specifiesa weighted Bonferroni test for the global intersection hypothesis HI and thetransition matrix G the weighted Bonferroni tests for the (m − 1)-way inter-section hypotheses HI\{j} = ⋂

i∈I\{j} Hi, j = 1, . . . , m. Note that the graphical

Algorithm 14.3 (Shortcut procedures in S)0. Set I = {1, . . . , m}.1. If arg mini ∈ I

piwi(I)

≤α, reject Hi; otherwise stop.

2. I → I \ {i}3. If |I| ≥ 1, go to Step 1; otherwise stop.

Page 28: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

376 Clinical Trial Biostatistics and Biopharmaceutical Applications

approach leads to a specification of m2 weights. Since closed testing involves2m − 1 intersection hypotheses, consonant Bonferroni-based closed test pro-cedures can be constructed for m ≥ 4, which are not covered by the graphicalapproach proposed so far. In Section 14.4, we will extend the graphicalapproach accordingly to include further test procedures based on weightedBonferroni and other types of intersection tests.

14.4 Extensions

In this section, we extend the core graphical approach from Section 14.3.1.More specifically, we describe graphical approaches for multiple test proce-dures using weighted Simes or parametric tests, provide extensions to groupsequential trials, and describe entangled graphs that have properties notshared by the approaches considered so far.

14.4.1 Parametric Graphical Test Procedures

The description of the graphical approaches has so far focused onBonferroni-based test procedures. Following Bretz et al. (2011b); Millen andDmitrienko (2011), we now discuss how a separation between the weight-ing strategy and the test procedure facilitates the application of a graphicalapproach beyond Bonferroni tests.

Graphical weighting strategies are conceptually similar to the graphs pro-posed in Section 14.3.1. They essentially summarize the complete set ofweights for the underlying closed test procedure. Weighted multiple testscan then be applied to the intersection hypotheses HJ, J ⊆ I = {1, . . . , m}, suchas weighted Bonferroni tests (leading to the graphical test procedures inSection 14.3.1.2), weighted min-p tests accounting for the correlation betweenthe test statistics (this section), or weighted Simes tests (Section 14.4.2).Weighting strategies are formally defined through the weights wi(I), i ∈ I, forthe global null hypothesis HI and the transition matrix G = (gij), where thetransition weights gij satisfy the regularity conditions (14.1). We additionallyneed to determine how the graph is updated once a node is removed. This canbe achieved by tailoring Algorithm 14.1 to the graphical weighting strategiesas follows. For a given index set J � I, let Jc = I\J denote the set of indices thatare not contained in J. Then the following algorithm determines the weightswj(J), j ∈ J. This algorithm has to be repeated for each J ⊆ I to generate thecomplete set of weights for the underlying closed test procedure.

Similar to what has been stated in Section 14.3.5, the weights wj(J), j ∈ J, areuniquely determined and do not depend on the sequence in which hypothe-ses Hj, j ∈ Jc, are removed in Step 1 of Algorithm 14.4. We refer to Example 1 inBretz et al. (2011b) for an illustration of Algorithm 14.4.

Page 29: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 377

Algorithm 14.4 (Weighting strategy)1. Select j ∈ Jc and remove Hj.2. Update the graph:

I → I \ {j}, Jc → Jc \ {j}

w�(I) →{

w�(I) + wj(I)gj�, for � ∈ I,0, otherwise,

g�k →{ g�k+g�jgjk

1−g�jgj�, for �, k ∈ I, � �= k, g�jgj� < 1,

0, otherwise.

3. If |Jc| ≥ 1, go to Step 1; otherwise w�(J)= w�(I), �∈ J, and stop.

Once a collection of weights has been determined using Algorithm 14.4, wecan apply any suitable weighted multiple test to the intersection hypothesesHJ, J ⊆ I. For example, if for HJ the joint distribution of the p-values pj, j ∈ J,is known, a weighted min-p test can be defined (Westfall and Young, 1993;Westfall et al., 1998). This test rejects HJ if there exists a j ∈ J such thatpj ≤ cJwj(J)α, where cJ is the largest constant satisfying

PHJ

⎛⎝⋃

j∈J

{pj ≤ cJwj(J)α}⎞⎠ ≤ α. (14.3)

If the p-values are continuously distributed, there is a cJ such that the rejec-tion probability is exactly α. Determination of cJ requires knowledge of thejoint null distribution of the p-values and computation of the correspond-ing multivariate cumulative distribution functions. If the test statistics aremultivariate normal or t distributed under the null hypotheses, these proba-bilities can be calculated using, for example, the mvtnorm package in R (Genzand Bretz, 2009). Alternatively, resampling-based methods may be used toapproximate the joint null distribution (Westfall and Young, 1993). If not all,but some of the multivariate distributions of the p-values are known, it isstill possible to derive conservative upper bounds of the rejection probability(Bretz et al., 2011b).

It follows immediately from the monotonicity condition (14.2) that theweighted parametric approaches considered here are consonant if

cJwj(J) ≤ cJ′wj(J′) for all J′ ⊆ J ⊆ I and j ∈ J′. (14.4)

If this new monotonicity condition (14.4) is satisfied, a sequentially rejec-tive test procedure similar to the Bonferroni-based graphical tests fromSection 14.3.1.2 can be defined.

Page 30: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

378 Clinical Trial Biostatistics and Biopharmaceutical Applications

Algorithm 14.5 (Weighted parametric tests)0. Set I = {1, 2, . . . , m}.1. Choose the maximal constant cI satisfying (14.3). Select a j ∈ I such

that pj ≤ cIwj(I)α and reject Hj; otherwise stop.2. Update the graph:

I → I \ {j}

w�(I) →{

w�(I) + wj(I)gj�, for � ∈ I,0, otherwise,

g�k →{ g�k+g�jgjk

1−g�jgj�, for �, k ∈ I, � �= k, g�jgj� < 1,

0, otherwise.

3. If |I| ≥ 1, go to Step 1; otherwise stop.

Note that the monotonicity condition (14.4) is often violated in practicewhen using weighted parametric tests. In such cases, Algorithm 14.5 no longerapplies and one has to go through the entire closed test procedure. That is,the weighting strategies from Algorithm 14.4 remain applicable, but the con-nection to a corresponding sequentially rejective test procedure is lost. For agiven weighting strategy, however, applying parametric tests exploiting thecorrelations between the test statistics is uniformly more powerful than theassociated Bonferroni-based test procedures from Algorithm 14.1.

We conclude this section with an example to illustrate Algorithm 14.5.To this end, we consider a simplified version of the ATMOSPHERE studyfrom Section 14.2.3. More specifically, we consider testing noninferiority andsuperiority for two doses. Assume that H1, H2 denote the two noninferiorityhypotheses (say, for low and high dose against control) and H3, H4 the twosuperiority hypotheses (for the same two dose-control comparisons).

The left graph in Figure 14.13 visualizes one possible weighting strategyfor this example. It is motivated by a strict hierarchy within dose: Supe-riority will only be assessed if noninferiority was shown previously for asame dose. If for one of the two doses efficacy can be shown for both non-inferiority and superiority, the associated weight is propagated to the otherdose. A related Bonferroni-based graphical test procedure was used in thediabetes case study in Section 14.3.2.1; see the right graph in Figure 14.5.Other Bonferroni-based graphical approaches for combined noninferior-ity and superiority testing were investigated by Hung and Wang (2010);Lawrence (2011); Guilbaud (2011).

In the following, we exploit the fact that the correlations between the fourtest statistics are known. Applying standard analysis-of-variance assump-tions with a known common variance, the complete joint distribution is

Page 31: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 379

Noninferiority

Superiority

Low dose High dose

H1 H2

H3 H4

12

12

1 11 1

0 0

Low dose High dose

H2

H3 H4

0

11

1

12

12

(a) (b)

FIGURE 14.13(a) Weighting strategy to test noninferiority and superiority for two doses. (b) Updated weight-ing strategy after rejecting H1.

known and we can apply (14.3), where α= 0.025. Note that if wj(J)= 0 forsome j ∈ J, the joint distribution degenerates. In our example, it suffices tocalculate bivariate or univariate probabilities, where the correlation is deter-mined only by the relative group sample sizes. For simplicity, assume thatthe group sample sizes are equal. Then the correlation between the nonin-feriority and superiority tests within a same dose is 1; all other correlationsare 0.5. Therefore, cJ = 1.0783 for J = {1, 2}, {1, 4}, {2, 3}, and {3, 4}; otherwise,cJ = 1. Since at most two hypotheses in any intersection can have weight 0.5,condition (14.4) is satisfied and we can apply Algorithm 14.5. This leads toa sequentially rejective multiple test procedures, where at each step eitherbivariate Dunnett z tests or individual z tests are used (Bretz et al., 2011b). Thisconclusion remains true if the common variance is unknown and Dunnett ttests or individual t tests are used. Note that similar multiple test proceduresare immediately applicable to testing for a treatment effect at two differentdose levels in an overall population and, if at least one dose is significant,continue testing in a prespecified subpopulation. This could apply to testing,for example, in the global study population and a regional subpopulation orin the enrolled full population and a targeted genetic subpopulation (Bretzet al., 2011b).

To illustrate the procedure, assume the unadjusted p-values p1 = 0.01,p2 = 0.02, p3 = 0.005, and p4 = 0.5. Following Algorithm 14.5, we havep1 ≤ cIw1(I)α= 0.0135 and can reject H1. The update step then leads to theright graph in Figure 14.13. Next, p3 ≤ 0.0135 and we can reject H3. This leavesus with H2, H4 and the weights w2({2, 4}) = 1, w4({2, 4}) = 0. Therefore, H2 isnow tested at full level α. Because p2 ≤α, we reject H2 and the procedurestops since H4 cannot be rejected. These calculations can be reproduced withthe gMCP package described in Section 14.3.3.2.

Page 32: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

380 Clinical Trial Biostatistics and Biopharmaceutical Applications

14.4.2 Simes-Based Graphical Test Procedures

Generalizations of the Bonferroni-based graphical test procedures fromSection 14.3.1 also apply when the correlations between the test statistics arenot exactly known, but certain restriction on them are assumed. In this case,the Simes test is a popular choice. Here, we consider the weighted Simestest introduced by Benjamini and Hochberg (1997), which rejects HI if forsome j ∈ I pj ≤

∑i∈Ij

αi =α∑

i∈Ijwi, where Ij =

{k ∈ I; pk ≤ pj

}. This weighted

Simes test reduces to the original (unweighted) Simes test (Simes, 1986) forwi = 1/m, i ∈ I. The weighted Simes test is conservative if, for example, thetest statistics follow a multivariate normal distribution with nonnegativecorrelations and the tests are one sided (Benjamini and Heller, 2007).

Applying the closure principle, the resulting multiple test procedurerejects Hi, i ∈ I, at level α if for each J ⊆ I with i ∈ J , there exists an index j ∈ Jsuch that

pj ≤ α∑k∈Jj

wk(J), (14.5)

where Jj ={k ∈ J; pk ≤ pj

}(Bretz et al., 2011b). If all weights are equal, this

reduces to the Hommel procedure (Hommel, 1988). Although full conso-nance is generally not available for Simes-based closed test procedures, wecan derive a partially sequentially rejective test procedure, which leads to thesame test decision as the closed test procedure defined previously. In the fol-lowing, we assume that the weights are exhaustive, that is,

∑k∈J wk(J)= 1 for

all subsets J ⊆ I.Algorithm 14.6 first considers those outcomes that are easy to verify (Steps 1

and 2) or where sequential rejection of the hypotheses is possible (Step 3).Only then one needs to compute for all remaining hypotheses and their

Algorithm 14.6 (Weighted Simes tests)1. If pi >α for all i ∈ I, stop and retain all m hypotheses.2. If pi ≤α for all i ∈ I, stop and reject all m hypotheses.3. Perform the Bonferroni-based graphical test procedure from Section

14.3.1. Let Ir denote the index set of rejected hypotheses and Icr its

complement in I. If |Icr| < 3, stop and retain the remaining hypotheses.

4. If |Icr| ≥ 3 consider the weights wi(Ic

r), i ∈ Icr , and the transition matrix

G defined on Icr as the new initial graph for the remaining hypotheses.

Compute the weights wk(J) for all J ⊆ Icr with Algorithm 14.4.

5. Reject Hi, i ∈ Icr , if for each J ⊆ Ic

r with i ∈ J there exists an index j ∈ Jsuch that pj ≤α

∑k ∈ Jj

wk(J).

Page 33: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 381

subsets the weights and apply the closed weighted Simes procedure. It canhappen though that no hypotheses can be rejected in Steps 2 and 3 and thatone has to perform Step 4 with the full set of all m hypotheses. Note that fora given weighting strategy, the Simes-based graphical test procedure is uni-formly more powerful than an associated Bonferroni-based procedure fromSection 14.3.1. A numerical example applying Algorithm 14.6 is given in Bretzet al. (2011b).

14.4.3 Graphical Approaches for Group Sequential Designs

We now consider the general situation of testing multiple hypotheses repeat-edly in time. More specifically, we extend the scope of the graphical approachto group sequential designs with one or more interim and one final anal-ysis. Under mild monotonicity conditions on the error spending functions,this allows the use of graphical test procedures in group sequential trials in asimilar way as described so far.

To this end, we first consider testing a single null hypothesis H : θ≤ 0 ath − 1 interim and one final analysis. Following the standard approachesfor group sequential designs (Whitehead, 1997; Jennison and Turnbull,2000; Proschan et al., 2006; Emerson, 2007), we assume (asymptoti-cally) multivariate normal statistics Zt, t = 1, . . . , h with E(Zt)= θ

√It and

Cov(Zt, Zt′) =√It/It′ ..., t ≤ t′. Here, It denotes the information available at

time point t, which is often proportional to the number of patients availableup to t and inversely proportional to the standard deviation of the underlyingmeasure of effect. We consider spending functions a(γ, y) with informationfraction y and significance level 0 <γ< 1 such that a(γ, 0)= 0, a(γ, 1)=γ, anda(γ, y) ≤ a(γ, y′) for 0 ≤ y < y′ ≤ 1. For a given time point t, yt = It/Imax andusing nominal p-values pt, we calculate the spent levels as

αt(γ)= a(γ, yt

) − a(γ, yt−1

) = P

({pt ≤α∗

t (γ)} ∩

t−1⋂s=1

{ps > α∗

s (γ)})

,

where the nominal levels α∗t (γ) serve as the interim decision boundaries. As

indicated in Maurer and Bretz (2013b), for many spending functions (includ-ing O’Brien-Fleming- and Pocock-type boundaries), it holds for γ′ >γ thatfor all t = 1, . . . , h

αt(γ′) ≥ αt(γ) ⇒ α∗

t (γ′) ≥ α∗

t (γ). (14.6)

This property is used in the following to ensure the validity of the graphicaltesting procedure.

We now consider testing m one-sided null hypotheses Hi, i ∈ I = {1, . . . , m},in a group sequential trial at time points t = 1, . . . , h. For each Hi, define its

Page 34: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

382 Clinical Trial Biostatistics and Biopharmaceutical Applications

Algorithm 14.7 (Weighted Bonferroni tests, h − 1 interim analyses)0. Set t = 1 and I = {1, 2, . . . , m}.1. At interim analysis t compute unadjusted p-values pi,t and nominal

significance levels αi,t =α∗i,t(αwi(I)) for i ∈ I.

2. Select a j ∈ I such that pj,t ≤αj,t, reject Hj and go to Step 3.If no such j exists and t < h, the trial can be continued with t → t + 1;go to Step 1 in this case, otherwise stop.

3. Update the graph:

I → I \ {j}

w�(J) →{

w�(I) + wj(I)gj�, for � ∈ I,0, otherwise,

g�k →{ g�k+g�jgjk

1−g�jgj�, for �, k ∈ I, � �= k, g�jgj� < 1,

0, otherwise.

4. If |J| ≥ 1, go to Step 1; otherwise stop.

spending function ai(γ, y) with spent levels αi,t(γ)= ai(γ, yt

)−ai(γ, yt−1

)and

nominal levels α∗i,t(γ). As in Section 14.4.1, we separate the weighting strat-

egy from the actual test procedure being employed. Within the framework ofclosed testing (Section 14.3.5), we reject an intersection hypothesis HJ, J ⊆ I,at time point t if pi,t ≤α∗

i,t(αwi(J) for at least one i ∈ J. It then can be shownthat the monotonicity conditions on the weights (14.2) and on the spendingfunctions (14.6) ensure sequentially rejective closed group sequential test pro-cedures (Maurer and Bretz, 2013b). In particular, graphical test procedurescan be derived by a slight modification of Algorithm 14.1.

A numerical example illustrating Algorithm 14.7 is given in Maurer andBretz (2013b). The approach mentioned earlier can be generalized to allowmore flexibility in the choice of the group sequential boundaries (Xi andTamhane, 2014a). Extensions of the graphical approach to adaptive groupsequential trials with treatment selection and others adaptations at interimare also available; see Sugitani et al. (2013, 2014); Klinglmueller et al. (2014)for details.

14.4.4 Graphical Approaches for Families of Hypotheses

Sometimes the structure for a family of hypotheses is best described by intro-ducing subfamilies of distinct hypotheses. An example of such a situation isgiven by the ATMOSPHERE case study from Section 14.2.3 and visualizedin Figure 14.7. In this example, the hypotheses associated with the secondary

Page 35: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 383

objectives are subsumed in two families H4 and H5. If in addition, one wouldconsider the remaining individual null hypotheses H1, H2, and H3 also as dis-tinct subfamilies Hi = {Hi} , i = 1, 2, 3, then every single node in Figure 14.7would represent a subfamily. An extended multiple test procedure wouldthen propagate local significance levels between these subfamilies, instead ofoperating on the individual null hypotheses. More specifically, a local sig-nificance level is propagated only if all individual null hypotheses withina subfamily are rejected, followed by an update of the graph according toAlgorithm 14.1 for single null hypotheses.

Such an approach provides an extension of serial gatekeeper procedures tosituations with nonhierarchical structures between the subfamilies; see Baueret al. (1998) for the special case of using a Holm procedures across the subfam-ilies of hypotheses and any multiple test procedure within each subfamily InSection 14.3.4, we introduced edges with weight ε to propagate local signifi-cance levels, conditional on rejecting all individual hypotheses in a subfamily.If the multiple test procedures within the subfamilies can also be representedas graphs, an overall multiple test procedure on a staged structure of subfam-ilies of hypotheses can therefore be visualized and performed as a graphicaltest procedure as well. In the following, we provide a more general algorithmthat is also valid if the test procedures within a subfamily are not necessarilygraphical.

Let Hi, i ∈ I = {1, . . . , m}, denote m families of ki ≥ 1 individual nullhypotheses Hij, j = 1, . . . , ki. Further, let Ii = {(i, 1), . . . , (i, ki)} denote the setof index pairs (i, j) of hypotheses Hij ∈Hi, i ∈ I. Finally, let ϕi denote anα-consistent multiple test procedure defined on Hi, which ensures that anyhypothesis in Hi rejected by ϕi at level α is also rejected at any level α′, α′ >α;see, for example, Hommel and Bretz (2008). This property allows the compu-tation of locally adjusted p-values p∗

ij. That is, p∗ij is the smallest significance

level at which Hij can be rejected with ϕi. For example, if ϕi denotes theHochberg procedure for Hi, pij the unadjusted p-value for Hij and pi(k) the kth

ordered p-value within Hi, then p∗i(k) = min

{1, min

[(ki − k + 1) pi(k), p∗

i(k+1)

]}.

If locally adjusted p-values p∗ij are available for each hypothesis Hij, then we

can define a sequentially rejective test procedure through Algorithm 14.8; seealso Maurer and Bretz (2014).

A further generalization of the procedure mentioned earlier is describedin Kordzakhia and Dmitrienko (2013). They consider separable multiple testprocedures ϕi that propagate a certain amount of the level αi to other sub-families already after rejecting an individual null hypothesis in Hi. The maindifference to Algorithm 14.8 is that after each rejection of an individual nullhypothesis, a certain error fraction is propagated to the other subfamiliesaccording to the transition weights of the graph. The transition weightsbetween the families, however, are updated only if all hypotheses, in asubfamily are rejected, as in Algorithm 14.8.

Page 36: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

384 Clinical Trial Biostatistics and Biopharmaceutical Applications

Algorithm 14.8 (Weighted Bonferroni tests on families of hypotheses)

0. Set I = {1, 2, . . . , m} and Ii ={(i, j) : j = 1, . . . , ki

}, i ∈ I.

1. Select a (i, j)∈ Ii, i ∈ I, such that p∗ij ≤αi, reject Hij and set Ii → Ii \ (i, j);

otherwise stop.2. If |Ii| ≥ 1 go to Step 1; otherwise update the graph:

I → I \ {i}

α� →{

α� + αigi�, for �∈ I,0, otherwise,

g�k →{ g�k+g�igijk

1−g�jgj�, for �, k ∈ I, � �= k, g�igi� < 1,

0, otherwise.

3. If |I| ≥ 1, go to Step 1; otherwise stop.

14.4.5 Entangled Graphs

The graphical procedures considered so far have no memory in the sense thatthe origin of the propagated significance level is ignored in subsequent iter-ations. However, there are clinical trial applications where this property isdesirable to reflect the underlying dependence structure of the study objec-tives. In such cases, it would be desirable that the further propagation ofsignificance levels depends on their origin and thus reflects the groupedparent–descendant structures of the hypotheses.

In the following, we extend the case study from Section 14.2.1 to moti-vate the need for test procedures with memory. In that case study, assumethat both diabetes endpoints HbA1c and body weight are measured in aninitial trial period (period 1) and that the trial is continued to a second period(period 2) to investigate potential CV complications by comparing the pooleddata from both doses against placebo. Let H5 denote the additional nullhypothesis. With the notation from Section 14.2.1, we have the followingrequirements:

i. The primary hypotheses H1 and H2 of period 1 are considered to bemore important than the period 2 hypothesis H5, which in turn wasconsidered to be more important than the secondary hypothesesH3 and H4.

ii. Both doses are considered equally important.iii. A secondary hypothesis should only be rejected if the associated

primary hypothesis for the same dose had been rejected.

Page 37: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 385

H1 rejected

H2

H5

H3 H4

1

H1, H5 rejected

H2

H3 H4

α2

α2

α2

α4

α4

12

12

12

12

Initial graph

H1 H2

H5

H3 H4

1 1

α2

α2

12

12

(c)(a) (b)

FIGURE 14.14Graphical test procedure without memory. (a) Initial graph. (b) Updated graph after rejectingH1. (c) Updated graph after rejecting H1 and H5.

We now illustrate that conditions (i) and (iii) cannot be satisfied simulta-neously with the graphical test procedures considered so far. Consider theinitial graph in Figure 14.14, where for simplicity weights 0 at the nodes arenot displayed. Assume that for a given significance level α, we have p1 <α/2and p2 >α. That is, H1 can be rejected and its level α/2 is propagated to H5,leading to the middle graph in Figure 14.14. If furthermore p5 <α/2, then H5is rejected as well. Its level α/2 is halved and propagated to both H3 and H4,each of which can now be tested at level α/4. This, however, violates con-dition (iii) mentioned earlier which requires that H4 should not be tested aslong as H2 is not rejected.

Requirement (iii) can be achieved by defining individual graphs for eachparent–descendant relationship and combine them afterward. We can splitthe initial graph from Figure 14.14 in two separate graphs (G1,G2), each beingdefined on the hypotheses H1 through H5; see Figure 14.15a. The basic idea isto test each hypothesis according to sum of the significance levels from bothindividual graphs G1 and G2. For example, in Figure 14.15a, we test H1 at levelα/2 + 0 and H2 at level 0 + α/2. The first step is therefore the same as for theinitial graph in Figure 14.14. However, if we assume that H1 can be rejected,we now update each individual graph using Algorithm 14.1. For ease of illus-tration, Figure 14.15b displays the resulting entangled graph by overlayingthe two individually updated graphs G1 and G2. The local significance levelsare displayed as vectors (α1i,α2i) for each hypothesis Hi, unless α1i =α2i = 0(in which case they are omitted for better readability). In the updated graphfrom Figure 14.15b, H2 and H5 can each be tested at level α/2. If now again H5is rejected, its level α/2 is propagated according to the rules for each individ-ual graph, resulting in Figure 14.15c. Note that now H4 is tested at level α/2

Page 38: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

386 Clinical Trial Biostatistics and Biopharmaceutical Applications

(b) H1 rejected

H2

H5

H3 H4

α2 , 0

1

1

1

(c) H1, H5 rejected

H3

0, α2

(a) Initial graph

H1 H2

H5

H3 H4

α2

1

1

1

H1 H2

H5

H3 H4

α2

1

1

2

α2 , 0

H2

H4

0, α2

1

FIGURE 14.15Entangled graphs (G1,G2) with memory. (a) Initial individual graphs. (b) Updated entangledgraph after rejecting H1. (c) Updated entangled graph after rejecting H1 and H5.

originating from H1, as opposed to the level α/4 in Figure 14.14. The examplein Figure 14.15 obviously can be improved by propagating further the levelsof H3 and H4; see Figure 2 in Maurer and Bretz (2013a) for an example.

More formally, entangled graphs and the associated sequentially rejec-tive test procedure can be described as follows. Assume that the structurefor m null hypotheses H1, . . . , Hm is given by n different parent–descendantrelationships. For each of these n relationships, we have tailored struc-tural dependencies between the hypotheses, leading to n individual graphsG1, . . . ,Gn. Let Gh = (αh, Gh), h = 1, . . . , n, denote the individual graphs withlocal significance levels αh = (αh1, . . . ,αhm), such that

∑nh=1

∑mi=1 αhi ≤α, and

m × m transition matrices Gh. The entries ghij ∈ Gh are freely chosen sub-ject to the regularity conditions 0 ≤ ghij ≤ 1, ghii = 0, and

∑m�=1 ghi� ≤ 1 for all

i, j = 1, . . . , m, h = 1, . . . , n.Each hypothesis Hi, i = 1, . . . , m, is tested at local level αi =

∑nh=1 αhi. If

for any j = 1, . . . , n, the null hypothesis Hj can be rejected (i.e., pj ≤αj) theneach graph Gh is updated separately. That is, the node j (i.e., hypothesis Hj)

Page 39: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 387

Algorithm 14.9 (Entangled Bonferroni-based graphs)0. Set I = {1, . . . , m}.1. Let αi =

∑nh=1 αhi, i ∈ I. Select a j ∈ I such that pj ≤αj and reject Hj;

otherwise stop.2. Update the graph:

I → I \ {j}

αh� →{

αh� + αhjghj�, for � ∈ I, h = 1, . . . , n0, otherwise,

gh�k →{ gh�k+gh�jghjk

1−gh�jghj�, for �, k ∈ I, � �= k, gh�jghj� < 1, h = 1, . . . , n,

0, otherwise.

3. If |I| ≥ 1, go to Step 1; otherwise stop.

is removed and the local levels αh as well as the transition matrices Gh forthe remaining m − 1 hypotheses are updated using a modified version ofAlgorithm 14.1; see Algorithm 14.9. Once this is achieved, the m − 1 remain-ing hypotheses are tested at the updated significance levels and the previoussteps are repeated until no further hypotheses can be rejected.

A formal proof for the validity of this sequentially rejective graphicaltest procedure is given by Maurer and Bretz (2013a). They also investi-gated further properties and alternative representations of entangled graphs.In particular, they showed the equivalence between the class of entangledgraphs proposed in this section and the default graphs proposed by Burmanet al. (2009). The entangled graphs can also be used to visualize gatekeep-ing procedures using the truncated Holm procedure (Dmitrienko et al., 2008;Strassburger and Bretz, 2008; Maurer and Bretz, 2013a). In the next section,we then show how to use entangled graphs to address the problem ofrejecting at least k out of m hypotheses in the context of gatekeeping.

14.4.6 Graphical Approaches for k-out-of-m Gatekeeper Problems

Multiple test procedures defined by entangled graphs can have propertiesthat a single graph cannot provide. In Section 14.4.5, we have seen how entan-glement can create memory. Another property not shared by single graphsin general is the requirement that at least k out of m hypotheses of a primaryfamily of hypotheses should be rejected before secondary hypothesis becometestable for 1 ≤ k ≤ m. Such a requirement is implicitly given, for example, inthe FDA Guidance for Industry on rheumatoid arthritis (FDA, 1999). It statesthat “. . . trial results were considered to support a conclusion of effective-ness when statistical evidence of efficacy was shown for at least three of thefour measures . . .” According to this guideline, the primary objective of a

Page 40: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

388 Clinical Trial Biostatistics and Biopharmaceutical Applications

rheumatoid arthritis trial is to demonstrate beneficial effect for at least k = 3out of m = 4 primary endpoints.

To discuss the problem formally, let F1 denote a family of m primaryhypotheses and F2 a family of n secondary hypotheses. We require that F2 istested only if at least k of the m primary hypotheses in F1 have been rejected.For the special case k = 1, we can construct multiple test procedures visu-alized by single graphs as long as there are edges with positive weightsconnecting each primary hypothesis with at least one secondary hypothesis.For the special case k = m, all primary hypotheses have to be rejected beforea secondary hypothesis is tested (Maurer et al., 1995). Any FWER controllingmultiple test procedure can be used to test F1 before testing the secondaryhypotheses. Such procedures can be constructed with a single graph by usingthe ε-edges introduced in Section 14.3.4. A simple example is the graph dis-played in Figure 14.6, where H3 is only tested after having rejected both H1and H2.

For 2 ≤ k ≤ m − 1, it does not seem to be possible to construct a multipletest procedure with a single graph that has the desired k-out-of-m gatekeep-ing property. It is possible, however, to construct a graph where a particularsubset of k primary hypothesis is rejected before a secondary hypothesis canbe rejected. In order to allow testing a secondary hypothesis if any subset ofprimary hypotheses of size k is rejected, we define serial gatekeeping graphsfor all

(mk

)subsets of k primary hypotheses and entangle them (Maurer and

Bretz, 2013a).For the sake of concreteness, we revisit the rheumatoid arthritis

example discussed earlier and consider the primary hypotheses familyF1 = {H1, H2, H3, H4} with m = 4. Let k = 3 and choose a Holm procedure withthree hypotheses for each of the four subsets I�, �= 1, . . . ,

(43

) = 4. Each of thefour subfamilies F1� = {

Hj; j ∈ I�}

is assigned an initial significance level α/4.That level is propagated to F2 if all three hypotheses are rejected in one of thesubfamilies F1�. Figure 14.16 visualizes one of the four component graphswith a single secondary hypothesis, that is, F2 = {H5}. Note that the subgraphon {H1, H2, H3} in Figure 14.16 visualizes the Holm procedure at level α/4.The other three component graphs are obtained by permuting the indices ofthe primary hypotheses in F1. The entangled graph consisting of the fourcomponent graphs then has the desired property that at least 3 out of the4 primary hypotheses have to be rejected before the secondary hypothesisH5 can be tested. More precisely, the resulting procedure is equivalent tothe following test procedure: The ordinary Holm procedure is performed onF1 at level α until any three of the primary hypotheses are rejected. Thenthe remaining primary hypothesis can be tested at level 3

4α. If it cannot berejected, the secondary hypotheses in F2 can be tested at level 1

4α with anyvalid multiple test procedure and otherwise at level α.

For large values of m, the number of components graphs becomes difficultto handle. However, if the component graphs are the same up to permutation,

Page 41: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 389

H1 H2 H3 H4

H5

α012

0

12

12

12

α12

12

12

12

α12

−ε ε

FIGURE 14.16Graphical visualization of a k-out-of-m gatekeeper procedure with m = 4 primary hypotheses,k = 3, and one secondary hypothesis H5.

it is sufficient to display only one of them, as done in Figure 14.16. In addition,if the component graphs are of a simple structure, the construction methodallows one to derive an equivalent sequentially rejective test procedure withthe desired properties that can be described in a few sentences or a simpledecision tree for any values of k and m.

Another class of k-out-of-m gatekeeper procedures has been proposed byXi and Tamhane (2014b). This procedures allow one to use Holm, Hochberg,or Hommel tests for the primary hypotheses before proceeding to the sec-ondary hypotheses. The basic idea for their Bonferroni-based k-out-of-mgatekeeper procedure can be described as follows. Let I1 denote the indexset of the m primary hypotheses, I2 that of the n secondary hypotheses andI = I1 ∪ I2. Consider an index set J = J1 ∪ J2 ⊆ I where J1 ⊆ I1 and J2 ⊆ I2. Letα∗

j (J1) denote the local significance levels of (any) consonant weighed Bon-ferroni test on I1. The local levels αj(J) of the hypotheses Hj for a weightedBonferroni test of the intersection hypothesis HJ have the following prop-erty: As long as the cardinality |J1| of J1 is greater than or equal to k, onesets αj(J)=α∗

j (J1) for j ∈ J1 and αj(J)= 0 for j ∈ J2. For |J1| < k, one choosesαj(J)≤α∗

j (J1) for ∈ J1 with inequality for at least one j and αj(J)≥ 0 for j ∈ J2

with inequality for at least one j. This is possible since the local significancelevels satisfy the monotonicity condition (14.2). The resulting closed test pro-cedure then has the k-out-of-m property. The entangled graphs describedearlier can represent some but not all of the procedures proposed by Xiand Tamhane (2014b) but remain more flexible to address partially orderedprimary and secondary hypotheses.

Page 42: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

390 Clinical Trial Biostatistics and Biopharmaceutical Applications

14.5 Conclusions

In this chapter, we provided an extensive overview of graphical approachesto multiple testing problems that are frequently encountered in clinicalstudies. The proposed graphical approaches offer the possibility to tailoradvanced multiple test procedures to structured families of hypotheses andvisualize complex decision strategies in an efficient and easily communica-ble way while controlling the FWER strongly at a designated significancelevel α. Many common multiple test procedures can be displayed using thegraphical approach, including fixed sequence, fallback, and gatekeeping pro-cedures. The main advantage, however, is the degree of flexibility offered bythis approach to meet the given clinical study objectives, as demonstratedby the various case studies in this chapter. The graphical approach coversa broad range applications and extensions, such as the calculation adjustedp-values and simultaneous confidence intervals, the use of weighted Bonfer-roni, Simes, and parametric tests, the application to group sequential trials,and the creation of entangled graphs for advanced clinical trial applications.

The proposed graphical approach is tailored to confirmatory trials thatcould potentially serve later as basis for regulatory decision making. Theneed to control strongly the FWER in this context is clear and mandatedby regulatory guidelines (ICH, 1998; CHMP, 2002). Beyond this, multiplic-ity has a much broader impact that raises challenging problems, which affectalmost every decision throughout drug development. Good decision makingand reproducibility need to account for multiplicity and might need dif-ferent solutions at different drug development stages. The details of howto address multiplicity are often not clear-cut and depend on the situation.Thus, on a broader scale, there is a need for statisticians to engage strategi-cally in the related clinical team discussions. There should be a transparentdiscussion with the medical/commercial colleagues with respect to whichendpoints are critically important to approval and label, and which are notas important and should be considered more exploratory. Clinical trials witha rather large number of hypotheses and simple hierarchically ordered testprocedures should be avoided. In the end, any methodology is only as goodas the business decisions that the teams are making with respect to theiridentification of important/critical endpoints.

References

Alosh, M., Bretz, F., and Huque, M. (2014). Advanced multiplicity adjustmentmethods in clinical trials. Statistics in Medicine, 33:693–713.

Alosh, M. and Huque, M. (2010). A consistency-adjusted alpha-adaptive strategy forsequential testing. Statistics in Medicine, 29:1559–1571.

Page 43: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 391

Bauer, P., Röhmel, J., Maurer, W., and Hothorn, L. (1998). Testing strategies in multi-dose experiments including active control. Statistics in Medicine, 17:2133–2146.

Benjamini, Y. and Heller, R. (2007). False discovery rates for spatial signals. Journal ofthe American Statistical Association, 102:1272–1281.

Benjamini, Y. and Hochberg, Y. (1997). Multiple hypothesis testing with weights.Scandinavian Journal of Statistics, 24:407–418.

Brannath, W. and Bretz, F. (2010). Shortcuts for locally consonant closed testprocedures. Journal of the American Statistical Association, 105:660–669.

Bretz, F., Hothorn, T., and Westfall, P. (2010). Multiple Comparisons Using R. Taylor &Francis, Boca Raton, FL.

Bretz, F., Maurer, W., Brannath, W., and Posch, M. (2009). A graphical approach tosequentially rejective multiple test procedures. Statistics in Medicine, 28:586–604.

Bretz, F., Maurer, W., and Hommel, G. (2011a). Test and power considerations formultiple endpoint analyses using sequentially rejective graphical procedures.Statistics in Medicine, 30:1489–1501.

Bretz, F., Posch, M., Glimm, E., Klinglmueller, F., Maurer, W., and Rohmeyer,K. (2011b). Graphical approaches for multiple comparison procedures usingweighted Bonferroni, Simes or parametric tests. Biometrical Journal, 53:894–913.

Burman, C., Sonesson, C., and Guilbaud, O. (2009). A recycling framework forthe construction of Bonferroni-based multiple tests. Statistics in Medicine, 28:739–761.

Chen, J., Luo, J., Liu, K., and Mehrotra, D. (2011). On power and sample size compu-tation for multiple testing procedures. Computational Statistics & Data Analysis,55:110–122.

CHMP (2002). Committee for Medical Product for Human Use (CHMP). Points to con-sider on “Multiplicity issues in clinical trials”. www.ema.europa.eu, accessedJuly 11, 2014.

Cowling, B., Hutton, J., and Shaw, J. (2006). Joint modeling of event counts andsurvival times. Applied Statistics, 55:31–39.

Dmitrienko, A., Kordzakhia, G., and Tamhane, A. (2011). Multistage and mixtureparallel gatekeeping procedures in clinical trials. Journal of BiopharmaceuticalStatistics, 21:726–747.

Dmitrienko, A., Offen, W., and Westfall, P. (2003). Gatekeeping strategies for clinicaltrials that do not require all primary effects to be significant. Statistics in Medicine,22:2387–2400.

Dmitrienko, A. and Tamhane, A. (2011). Mixtures of multiple testing proceduresfor gatekeeping applications in clinical trial applications. Statistics in Medicine,30:1473–1488.

Dmitrienko, A., Tamhane, A., and Bretz, F. (2009). Multiple Testing Problems inPharmaceutical Statistics. Taylor & Francis, Boca Raton, FL.

Dmitrienko, A., Tamhane, A., and Wiens, B. (2008). General multi-stage gatekeepingprocedures. Biometrical Journal, 50:667–677.

Emerson, S. (2007). Frequentist evaluation of group sequential clinical trial designs.Statistics in Medicine, 26:5047–5080.

FDA (1999). U.S. Food and Drug Administration (FDA). Guidance for industry—Clinical development programs for drugs, devices, and biological products forthe treatment of rheumatoid arthritis (RA). www.fda.gov, accessed July 11, 2014.

Gabriel, K. (1969). Simultaneous test procedures—Some theory of multiple compar-isons. Annals of Mathematical Statistics, 40:224–250.

Page 44: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

392 Clinical Trial Biostatistics and Biopharmaceutical Applications

Genz, A. and Bretz, F. (2009). Computation of Multivariate Normal and t Probabilities.Springer, Heidelberg, Germany.

Guilbaud, O. (2007). Bonferroni parallel gatekeeping—Transparent generalization,adjusted p-values and short direct proofs. Biometrical Journal, 49:917–927.

Guilbaud, O. (2008). Simultaneous confidence regions corresponding to Holm’sstepdown procedure and other closed testing procedures. Biometrical Journal,50:678–692.

Guilbaud, O. (2011). Note on simultaneous inferences about non-inferiority andsuperiority for a primary and a secondary endpoint. Biometrical Journal, 53:927–937.

Holm, S. (1979). A simple sequentally rejective multiple test procedure. ScandinavianJournal of Statistics, 6:65–70.

Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modifiedbonferroni test. Biometrika, 75:383–386.

Hommel, G. and Bretz, F. (2008). Aesthetics and power considerations in multipletesting—A contradiction? Biometrical Journal, 50:657–666.

Hommel, G., Bretz, F., and Maurer, W. (2007). Powerful short-cuts for multiple testingprocedures with special reference to gatekeeping strategies. Statistics in Medicine,26:4063–4073.

Hung, H. and Wang, S. (2009). Some controversial multiple testing problems inregulatory applications. Journal of Biopharmaceutical Statistics, 19:1–11.

Hung, H. and Wang, S. (2010). Challenges to multiple testing in clinical trials.Biometrical Journal, 52:747–756.

Huque, M., Alosh, M., and Bhore, R. (2011). Addressing multiplicity issues of a com-posite endpoint and its components in clinical trials. Journal of BiopharmaceuticalStatistics, 21:610–634.

ICH (1998). International Conference on Harmonization. Topic E9: Statistical Principles forClinical Trials. www.ich.org, accessed July 11, 2014.

Jennison, C. and Turnbull, B. (2000). Group Sequential Methods with Applications toClinical Trials. Chapman and Hall/CRC, Boca Raton, FL.

Julious, S. and McIntyre, N. (2012). Sample sizes for trials involving multiplecorrelated must-win comparisons. Pharmaceutical Statistics, 11:177–185.

Kim, H., Entsuah, R., and Shults, J. (2011). The union closure method for testing afixed sequence of families of hypotheses. Biometrika, 98:391–401.

Klinglmueller, F., Posch, M., and Koenig, F. (2014). Adaptive graph-based multipletesting procedures. (submitted).

Kordzakhia, G. and Dmitrienko, A. (2013). Superchain procedures in clinical trialswith multiple objectives. Statistics in Medicine, 32:486–508.

Krum, H., Massie, B., Abraham, W., and et al. (2011). Direct renin inhibition inaddition to or as an alternative to angiotensin converting enzyme inhibition inpatients with chronic systolic heart failure: rationale and design of the aliskirentrial to minimize outcomes in patients with heart failure (ATMOSPHERE) study.European Journal of Heart Failure, 13:107–114.

Lawrence, J. (2011). Testing non-inferiority and superiority for two endpoints forseveral treatments with a control. Pharmaceutical Statistics, 10:318–324.

Li, J. and Mehrotra, D. (2008). An efficient method for accommodating potentiallyunderpowered primary endpoints. Statistics in Medicine, 27:5377–5391.

Liu, L., Wolfe, R., and Huang, X. (2004). Shared frailty models for recurrent eventsand terminal event. Biometrics, 60:747–756.

Page 45: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

Graphical Approaches to Multiple Testing 393

Luo, X., Chen, G., Ouyang, S., and Turnbull, B. (2013). A multiple comparisonprocedure for hypotheses with gatekeeping structure. Biometrika, 100:301–317.

Marcus, R., Peritz, E., and Gabriel, K. (1976). On closed testing procedure with specialreference to ordered analysis of variance. Biometrika, 63:655–660.

Maurer, W. and Bretz, F. (2013a). Memory and other properties of multiple testprocedures generated by entangled graphs. Statistics in Medicine, 32:1739–1753.

Maurer, W. and Bretz, F. (2013b). Multiple testing in group sequential trials usinggraphical approaches. Statistics in Biopharmaceutical Research, 5:4:311–320.

Maurer, W. and Bretz, F. (2014). A note on testing families of hypotheses usinggraphical procedures. Statistics in Medicine (to appear).

Maurer, W., Glimm, E., and Bretz, F. (2011). Multiple and repeated testing of primary,co-primary and secondary hypotheses. Statistics in Biopharmaceutical Research,3:336–352.

Maurer, W., Hothorn, L., and Lehmacher, W. (1995). Multiple comparisons in drugclinical trials and preclinical assays: A-priori ordered hypotheses. In Vollmar, J.,ed., Biometrie in der chemisch-pharmazeutischen Industrie. Fischer Verlag, Stuttgart,Germany.

Maurer, W. and Klinglmüller, F. (2013). Sequentially rejective test procedures for par-tially ordered and algebraically dependent systems of hypotheses. Talk given atthe International Conference on Simultaneous Inference, Hannover, Germany.

Maurer, W. and Mellein, B. (1988). On new multiple tests based on indepen-dent p-values and the assessment of their power. In Bauer, P., Hommel, G.,and Sonnemann, E., eds., Multiple Hypothesenprüfung. Springer Verlag, Berlin,Germany.

Millen, B. and Dmitrienko, A. (2011). A class of flexible closed testing procedures withclinical trial applications. Journal of Biopharmaceutical Statistics, 3:14–30.

O’Neill, R. (1997). Secondary endpoints cannot be validly analyzed if the primaryendpoint does not demonstrate clear statistical significance. Controlled ClinicalTrials, 18:550–556.

Proschan, M., Lan, K., and Wittes, J. (2006). Statistical Monitoring of Clinical Trials: AUnified Approach. Springer, New York.

Rauch, G. and Beyersmann, J. (2013). Planning and evaluating clinical trials with com-posite time-to-first-event endpoints in a competing risk framework. Statistics inMedicine, 32:3595–3608.

Rohmeyer, K. and Klinglmueller, F. (2014). gMCP: Graph Based Multiple Test Procedures.R package version 0.8-6. http://cran.r-project.org/web/packages/gMCP/,accessed July 11, 2014.

Senn, S. and Bretz, F. (2007). Power and sample size when multiple endpoints areconsidered. Pharmaceutical Statistics, 6:161–170.

Simes, R. (1986). An improved Bonferroni procedure for multiple tests of significance.Biometrika, 73:751–754.

Sozu, T., Sugimoto, T., and Hamasaki, T. (2010). Sample size determination in clin-ical trials with multiple co-primary binary endpoints. Statistics in Medicine,29:2169–2179.

Strassburger, K. and Bretz, F. (2008). Compatible simultaneous lower confidencebounds for the holm procedure and other bonferroni based closed tests. Statisticsin Medicine, 27:4914–4927.

Sugitani, T., Bretz, F., and Maurer (2014). A simple and flexible graphical approachfor adaptive group-sequential clinical trials. (submitted for publication).

Page 46: Graphical Approaches to Multiple Testing€¦ · Any multiple test procedure for H1,...,Hm should guarantee strong FWER control at level α, be tailored to the structured trial objectives

394 Clinical Trial Biostatistics and Biopharmaceutical Applications

Sugitani, T., Hamasaki, T., and Hamada, C. (2013). Partition testing in confirmatoryadaptive designs with structured objectives. Biometrical Journal, 55:341–359.

Westfall, P. and Krishen, A. (2001). Optimally weighted, fixed sequence, and gate-keeping multiple testing procedures. Journal of Statistical Planning and Inference,99:25–40.

Westfall, P., Krishen, A., and Young, S. (1998). Using prior information to allocatesignificance levels for multiple endpoints. Statistics in Medicine, 17:2107–2119.

Westfall, P., Tobias, R., and Wolfinger, R. (2011). Multiple Comparisons and Multiple TestsUsing SAS. SAS, Cary, NC.

Westfall, P. and Young, S. (1993). Resampling-Based Multiple Testing: Examples andMethods for p-Value Adjustment. Wiley, New York.

Whitehead, J. (1997). The Design and Analysis of Sequential Clinical Trials. Wiley,Chichester.

Wiens, B. and Dmitrienko, A. (2005). The fallback procedure for evaluating a singlefamily of hypotheses. Journal of Biopharmaceutical Statistics, 15:929–942.

Wiens, B., Dmitrienko, A., and Marchenko, O. (2013). Selection of hypothesis weightsand ordering when testing multiple hypotheses in clinical trials. Journal ofBiopharmaceutical Statistics, 23(6):1403–1419.

Xi, D. and Tamhane, A. (2014a). Allocating recycled significance levels in groupsequential procedures for multiple endpoints. (submitted).

Xi, D. and Tamhane, A. (2014b). A general multistage procedure for k-out-of-ngatekeeping. Statistics in Medicine, 33(8):1321–1335.

Xiong, C., Yu, K., Gao, F., Yan, Y., and Zhang, Z. (2005). Power and sample for clini-cal trials when efficacy is required size in multiple endpoints: application to analzheimer’s treatment trial. Clinical Trials, 2:387–393.