Download - D. Gillen, FDA Repro, Jan 23-241 Statistical Issues in Contraceptive Trials Daniel L. Gillen, PhD Department of Statistics University of California, Irvine.

D. Gillen, FDA Repro, Jan 23-24 1

Statistical Issues in Contraceptive Trials

Daniel L. Gillen, PhD

Department of Statistics

University of California, Irvine

FDA Reproductive Drugs Advisory Committee Meeting, Jan 23-24


Minimum requirements of a clinical trial

Appropriate target population

Use of appropriate comparison groups

Use of appropriate outcome measure

Ability to maintain statistical criteria for evidence Controlling type I and II errors in the Frequentist setting


Outline

Outcome measures Pearl Index vs. life-table methods

Comparison populations Historical vs. active control trials

Defining statistical evidence Testing for superiority vs. non-inferiority


Outcome Measures:Pearl Index vs. Life Table Methods


The Pearl Index

The Pearl Index (number of pregnancies per 100 woman years) is a common measure used to summarize contraceptive effectiveness

However, a drawback of the Pearl Index is that in most situations it is dependent on time and must be interpreted accordingly Such dependence occurs because of the changing

baseline risk of pregnancy within study samples as time marches forward


Ex: Sensitivity of Pearl Index to duration of follow-up

Suppose our study population consists of two groups

“Low risk” group (90% of population): Constant risk of pregnancy 1 year probability of pregnancy is 5%

“High risk” group (10% of population): Constant risk of pregnancy 1 year probability of pregnancy is 50%


Ex (cont’d): One-year Pearl Index

Now consider the Pearl Index calculated over the first year

Expected number of pregnancies 5000*(0.90*0.05 + 0.10*0.50) = 475

Expected person-years at risk with censoring for pregnancy 4525*1 + 475*.5 = 4762.5

Pearl Index (475 / 4762.5)*100 = 9.97 pregnancies per 100 per year


Ex (cont’d): Two-year Pearl Index

For the Pearl Index calculated over 2 years, we need to consider the impact of censoring the “high risk” group at pregnancy

By the end of one year

Number left in low risk group: 5000*0.90*(1-0.05) = 4275 Number left in high risk group: 5000*0.10*(1-0.50) = 250 Percent of total population in high risk group at one year is

250/4275 = 5.8%



Now consider the Pearl Index calculated between years 1 and 2

Expected number of pregnancies occurring between 1 and 2 years of follow-up 4525*(0.942*0.05 + 0.058*0.50) = 344.4

Expected person-years at risk between year 1 and year 2 4180.6*1 + 344.4*.5 = 4352.8 person-years

Pearl Index calculated between years 1 and 2 (344.4 / 4352.8)*100 = 7.92 pregnancies per 100 per year



Now consider the Pearl Index calculated over 2 years

Expected number of pregnancies observed over 2 years 475 + 344.4 = 819.4

Expected person-years at risk over 2 years 4762.5 + 4352.8 = 9115.3 person-years

Pearl Index calculated over 2 years (819.4 / 9115.3)*100 = 8.99 pregnancies per 100 per year


When is the Pearl Index independent of study support?

The Pearl Index will change with the length of follow-up unless:

1. The rate of pregnancies is homogeneous across all possible subgroups

2. This rate remains constant with time


When is the Pearl Index independent of study support?

In the previous example, it should be noted that even if we allow participants with failures to re-enter the risk set the Pearl Index will still depend upon time

This is because a failure results in less at-risk time, thus total years of follow-up will be proportionately less in the “high risk” group as duration of maximal follow-up increases


A further issue in quantifying the Pearl Index… Most confidence intervals for the Pearl Index assume a

Poisson Distribution This distribution is defined as having variance equal to the

mean (or rate)

However, count or rate data is typically characterized as stemming from an overdispersed Poisson distribution

That is, the true variance in the rate that we observe is more that we assume from the Poisson distribution

Overdispersion in Poisson rates typically arises from heterogeneity of patient populations


Computation of confidence intervals for the Pearl Index

Consider our previous example with a “low risk” and a “high risk” group

Low risk group (90% of population): Constant risk of pregnancy 1 year probability of pregnancy is 5%

High risk group (10% of population): Constant risk of pregnancy 1 year probability of pregnancy is 50%


Computation of confidence intervals for the Pearl Index We previously calculated the (true) 1 year Pearl Index

to be 9.97 pregnancies per 100 per year

Suppose that in reality, we observed 457 pregnancies over 1 year with a total of 4763 years of followup, resulting in a Pearl Index of 9.60 per 100 per year

Assuming a Poisson distribution the corresponding 95% confidence interval for the 1 year Pearl Index would be (8.73, 10.51)


Computation of confidence intervals for the Pearl Index However, because the Pearl Index is really composed of a

mixture of Poisson distributions (from the high and low risk groups) the true variance is actually 19.2% larger than assumed by the usual (single) Poisson model

This means that we have underestimated the variance, ie. Our confidence interval is shorter than it should be!

In this case, a 95% confidence interval accounting for the heterogeneity of groups is (8.63, 10.55).

This is approximately 8% wider than the previous interval


How to deal with the changing composition of the risk set?

We illustrated one way in our example

Consider the probability of failure at specific time points by using conditional probability

For example, if T is the time of failure we can compute the probability of failure within two years as

Pr[T<2] = 1-Pr[T>2] = 1 - Pr[T>2|T>1]Pr[T>1] = 1-(1-0.0792)*(1-0.0997) = 0.171


How to deal with the changing composition of the risk set?

This is called a life-table estimate In the setting of contraceptive failure, these

conditional probabilities are typically computed monthly to more accurately incorporate the risk set (see eg. Potter, 1966)

When the life-table estimate is evaluated at all (distinct) failure times, this is called a Kaplan-Meier estimate.


Are there any benefits of to using the Pearl Index? Clearly, the Pearl Index has been in wide use

The reasons for this are

Ease of interpretation Although the Kaplan-Meier estimator also has a clinically

relevant interpretation (probability of failure over T years of use)

For historically controlled trials, there is a great deal of data summarized in terms of the Pearl Index

This will, of course, change as the popularity of Kaplan-Meier estimates grow in the field


Can we incorporate changing treatment regiments?

Patients may discontinue use or use additional contraceptives for some intervals of time

Technically, the Kaplan-Meier estimator could incorporate such left and right censoring.

However, it is not clear when patients should re-enter the risk set


Can we incorporate changing treatment regiments? For example, consider the case where a participant uses back-up

contraception during the interval (t1, t2).

This individual could be considered at risk for the interval (0, t1) then re-entered into the risk set at time t2.

However, by doing this we are implicitly making the assumption that this person’s hazard (or risk of pregnancy) at time t2 is the same as all others who have been at risk from (0, t2)

This is not a reasonable assumption to me and I would advise against it


Can we incorporate changing treatment regiments?

Another option for incorporating changing treatment regiments would come from post-hoc analyses

Stratified Kaplan-Meier estimates Number of strata could become large

Time-dependent covariates Eg. Consider a proportional hazards framework


Regardless of the measure, what defines a failure and who is at risk?

For all new interventions we must consider:

Safety: Are there adverse effects that clearly outweigh any potential benefit?

Efficacy: Can the intervention reduce the probability of unintended pregnancy in a beneficial way?

Effectiveness: Would adoption of the intervention as a standard reduce the probability of unintended pregnancy in the population?


One difference between evaluation of efficacy and effectiveness is in what defines a failure and who should be included in the risk set

In a clinical trial setting we can truly only evaluate efficacy because of possible selection bias of patients entering contraceptive trials

However, even in the clinical trial setting it is useful to evaluate

Intervention failure rates during actual use (including inconsistent or incorrect use)

Intervention failure rates during perfect use (see eg. Trussell, Contraception, 2004)



To assess true method efficacy, counting only “method failures” during perfect use, we must only include perfect use exposure patients in the risk set

Also, need to consider if those who are lost to follow-up should be considered at risk all the way up to the time of drop-out

One reasonable approach is to censor patients three months prior to the time at which they become lost to follow-up (Trussell, SIM, 1991)



Historical vs. Active Control Trials


Historical control trials vs. active control trials

In the past many methods have been assessed via a historical control trial

Eg. Criteria such as a Pearl Index of 1.5 (or more recently 2) or less has been used an efficacy criteria

Such criteria stems from the experience of historical controls

However, biases resulting from historical control studies can be numerous. Particularly when study samples are not comparable with respect to baseline risk, evaluative measure of outcome, or duration of study.


Criteria for superiority in historical control trials As noted, past studies have considered point estimates of

the (one year) Pearl Index of less than 1.5 or 2 unintended pregnancies per 100 per year

However, we must also acknowledge uncertainty of these estimates

EMEA requires sufficient sample size to guarantee the width of the 95% CI for the Pearl Index to be no larger than 1

Better (in my opinion) to require that upper bound of CI is less than the chosen threshold

In either case, if the Pearl Index is used the previous notes on computation of the CI need to be considered


Historical control trials vs. active control trials

Because it is impossible to guarantee comparability between historical controls and current study samples, it is almost always advantageous to employ randomization when ethically feasible

Given a wide use of standard contraceptives, it is not feasible to consider a placebo controlled trial

However, one can (and should) consider the use of an active control when comparable interventions are in use

Also allows for comparison of entire survival curve (logrank test or proportional hazards model?)


Superiority vs. Non-Inferiority in Active Control Trials


Superiority vs. non-inferiority in active control trials Statistical criteria for evidence in a superiority trial

Evidence to rule out equality of effect as measured by the chosen parameter (eg. Pearl Index, 1-year survival estimate, or a hazard ratio)

Example: Contrast may be difference in 1-year failure rates as measured

by the Kaplan-Meier estimator KMTx(1) - KMAC(1) Test: H0: KMTx(1) - KMAC(1) 0

Vs. H1: KMTx(1) - KMAC(1) < 0

Rejection of null hypothesis corresponds to upper bound of CI for KMTx(1) - KMAC(1) being less than 0


Superiority vs. non-inferiority in active control trials Statistical criteria for evidence in a non-inferiority trial

Evidence to rule out some margin of efficacy less than the active control

Example:

Contrast may be difference in 1-year failure rates as measured by the Kaplan-Meier estimator

KMTx(1) - KMAC(1) Test: H0: KMTx(1) - KMAC(1)

Vs. H1: KMTx(1) - KMAC(1) < for some > 0

Rejection of null hypothesis corresponds to upper bound of CI for KMTx(1) - KMAC(1) being less than


Superiority vs. non-inferiority in active control trials

When is it reasonable to consider non-inferiority instead of superiority?

ICH E-10 Guidelines Active control treatment must truly be active in the study

population

If active control is truly active in the study population Can a margin to define non-ineferiority be established? If active control is standard of care, is new treatment also

superior on secondary endpoints?



Issues in setting the non-inferiority “margin”?

What measure compares distributions?

Is the treatment effect random?

How much of a decrease in effect is acceptable?

How to account for variability in the estimate(s) from historical trials?



Precedence for setting the non-inferiority “margin”

Is the treatment effect random? Ideally use meta-analysis of multiple trials Careful! Do trials have same duration of follow-up?

How much of a decrease in effect is acceptable? 10%, 20%, 50% of active control effect?

How to account for variability in the estimate(s) from historical trials?

Use worst case from historical 95% CI? Explicitly account for variability in historical trial


Summary


Summary Need to define appropriate target population, comparison

group, outcome measure, and maintain statistical criteria for evidence

Pearl Index is (usually) implicitly dependent on the length of follow-up, whereas Kaplan-Meier (life table) estimates make this dependence explicit

In either case, we need to obtain correct inference (CI’s) and the definition of the risk set must correspond to the definition of failure

When ethically and logistically possible, active controls should be used

If historical controls are used, uncertainty should be accounted for in defining superiority criteria