THE PERFORMANCE OF INVERSE PROBABILITY OF TREATMENT ...

THE PERFORMANCE OF INVERSEPROBABILITY OF TREATMENT

WEIGHTING AND PROPENSITY SCOREMATCHING FOR ESTIMATINGMARGINAL HAZARD RATIOS

By Jonatan Nåtman

Department of Statistics

Uppsala University

Supervisor: Harry Khamis

2019

ABSTRACT

Propensity score methods are increasingly being used to reduce the effect of measured con-

founders in observational research. In medicine, censored time-to-event data is common. Using

Monte Carlo simulations, this thesis evaluates the performance of nearest neighbour matching

(NNM) and inverse probability of treatment weighting (IPTW) in combination with Cox pro-

portional hazards models for estimating marginal hazard ratios. Focus is on the performance

for different sample sizes and censoring rates, aspects which have not been fully investigated in

this context before. The results show that, in the absence of censoring, both methods can reduce

bias substantially. IPTW consistently had better performance in terms of bias and MSE com-

pared to NNM. For the smallest examined sample size with 60 subjects, the use of IPTW led to

estimates with bias below 15 %. Since the data were generated using a conditional parametri-

sation, the estimation of univariate models violates the proportional hazards assumption. As a

result, censoring the data led to an increase in bias.

Keywords: Monte Carlo simulations, propensity score, survival analysis, Cox model, censor-

ing rate, sample size

Contents

1 Introduction 1

2 Background 2

2.1 Causal Inference and the Potential Outcome Framework . . . . . . . . . . . . 2

2.2 Propensity Score Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Matching on the Propensity Score . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Inverse Probability of Treatment Weighting . . . . . . . . . . . . . . . 5

2.3 Survival Analysis and the Cox Proportional Hazards model . . . . . . . . . . . 5

2.4 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methodology 7

3.1 Data Generating Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Scenario A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.2 Scenario B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.3 Scenario C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.4 Censoring Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.5 Conditional Treatment Effect . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Results 12

5 Discussion 19

2

1 Introduction

Randomised control trials (RCTs) are generally seen as the gold standard in medical research.

If treatment assignment is random, the treatment status should not be confounded with either

measured or unmeasured baseline characteristics. In some situations randomised controlled

studies cannot be conducted for ethical reasons. Research based on observational data also has

several advantages compared to randomised studies. The study subjects have not undergone

a strict selection procedure and might be more representative of the reality. The increasing

amount of register data also makes it possible to conduct medical studies at a lower cost, partic-

ularly when studying rare events or if long follow-up times are needed. The main disadvantage

of observational studies where the treatment assignment has not been randomised is the risk of

confounders that can make causal inference difficult.

Propensity score methods are increasingly being used to minimise the effects of confound-

ing when using observational data to estimate the effect of different treatments or exposures.

Censored time-to-event data are common in medical research. The effect of a treatment on

survival is often described with both a relative and an absolute measure in the biomedical lit-

erature, where hazard ratios and survival curves are the most commonly used measures. In

recent years, the performance of different propensity score methods to estimate hazard ratios

and survival curves have been examined in several simulation studies. These studies have fo-

cused on relatively large samples of 1000 or 10 000 subjects. Propensity score methods are

typically used in register based research where large amounts of data are available. However,

these methods are also often used for substantially smaller samples, for example when studying

patient groups with rare diseases. Also, the effect of censoring have not been fully investigated

in previous studies.

The aim of this thesis is to examine the performance of the most commonly used propen-

sity score methods in combination with Cox proportional hazards model to estimate marginal

hazard ratios. The main research question is "What is the behaviour of bias and MSE of 1:1

greedy nearest neighbour matching, with and without calipers, and inverse probability of treat-

ment weighting, when estimating marginal hazard ratios? We will focus on the performance

for different sample sizes and different censoring rates. When estimating the propensity scores,

four different models will be considered: i) a model including only the true confounder(s), ii)

a model including all variables related to outcome, iii) a model including all variables related

to treatment selection, and iiii) a model including all variables related to either outcome or

1

treatment selection. As a second question we will examine which of these models results in

estimates of marginal hazard ratios with the lowest bias and MSE.

The outline of the thesis is as follows. Section 2 provides an introduction to the theoretical

framework of this thesis. First, the potential outcome framework is presented followed by a

description of the propensity score methods. In the second part of section 2, the Cox propor-

tional hazards model is presented. Finally, a brief review of the previous research on propensity

score methods to estimate marginal hazard ratios is given. Section 3 describes the design of the

simulation study, followed by a presentation of the results in section 4. In the final part, section

5, the results are summarised and discussed.

2 Background

2.1 Causal Inference and the Potential Outcome Framework

The Rubin causal model and the concept of potential outcomes was introduced by Rubin1. Let

Zi denotes the treatment status of a subject ( Zi = 0 if the subject was assigned to the control

group and Zi = 1 if the subject received the treatment). This framework then assumes that

each subject has two potential outcomes. The first potential outcome, Yi(0), is the outcome

that would have been observed if the subject did not receive the treatment. The second, Yi(1),

is the outcome that would have been observed if the subject received the treatment. For each

subject, the individual treatment effect could then be defined as the difference between the two

potential outcomes: π1 = Yi(0)− Yi(1). However, only one of the potential outcomes can be

observed.

The average treatment effect (ATE) is the average effect of moving an entire population

from untreated to treated, expressed as E[Yi(0) − Yi(1)]. A different measure of the effect is

the treatment effect for the treated (ATT), E[Yi(0)−Yi(1)|Z = 1]. This measure describes the

average treatment effect in the population that was ultimately treated. In a randomised study,

the two measures will be equal. In an observational study however, there is no reason to expect

the ATE and ATT to be equal and which of the two measures is of greater interest depends on

the study question.

With time-to-event data, the average treatment effect would be the mean difference in sur-

vival times that results from the treatment. However, medical researchers are often more in-

terested in measures such as the difference in the probability of observing an event within a

2

specified follow-time or in the relative effect described by hazard ratios. In this thesis, which

focuses on estimating hazard ratios, we will use Austin’s2 modified definitions of ATE and

ATT. In the following sections, ATE will refer to the hazard ratio that would have been ob-

tained by regressing the hazard function on a treatment indicator in a dataset consisting of both

potential outcomes of all subjects. Similarly, ATT will be used to refer to the same analysis but

restricted to the population of subjects that actually received the treatment.

2.2 Propensity Score Methods

The propensity score is the probability of treatment assignment conditional on observed base-

line characteristics, ei = Pr(Zi = 1|Xi). Rosenbaum and Rubin3 showed that adjustment for

the propensity score is sufficient to remove bias due to measured confounders. In an obser-

vational study the propensity score is generally unknown and needs to be estimated. Several

methods can be used but the most common is estimation using logistic regression.

There is a discussion on how to select the propensity score model. The advice by Rubin

and Thomas4 is to include all variables that are related to the outcome, regardless of whether

they are related to exposure. Variables that are related only to exposure should not be included.

This is also the conclusion from two simulation studies5,6. This strategy for variable selection

might seem counter-intuitive but has been motivated by the notion that even if a variable is

theoretically unrelated to the exposure, there is a possibility that it is related to the exposure

by chance in the dataset. If that is the case, the variable will be a confounder in that particular

dataset. On the other hand, including a variable that is strongly related to treatment selection

but unrelated to the outcome will induce variability to the propensity score model that does not

correct for confounding5 .

When the propensity score has been estimated, there are four broad methods to adjust for it:

matching on the propensity score, stratification on the propensity score, covariate adjustment

using the propensity score and inverse probability of treatment weighting (IPTW)3,7. All of

these methods have been applied to survival data and have recently been evaluated in simula-

tion studies. It has been showed that matching and IPTW estimate marginal effects whereas

stratification and covariate adjustment estimate conditional effect8,9. Therefore, this thesis will

focus on propensity score matching and IPTW.

3

2.2.1 Matching on the Propensity Score

Matching on the propensity score means that a matched set is formed of treated and untreated

subjects that have similar values of the propensity score. There are several ways to implement

propensity score matching and Austin2 gives a good overview of different algorithms. The

most common implementation is to form pairs of treated and control subjects, which is called

one-to-one (1:1) matching. Another approach is one-to-many (1:M), where one treated subject

is matched to two or more control subjects. Full matching is another method that makes use

of all individuals in the data by forming matched sets of one treated subject and one or more

controls. Full matching can estimate both ATE and ATT whereas 1:1 matching always targets

ATT2. We will focus only on 1:1 matching, which is the most commonly used method for

propensity score matching.

A matching algorithm also needs to be chosen. The two most common are optimal match-

ing and greedy nearest neighbour matching (NNM)2. The first aims to minimise the average

difference in propensity score within the matched pairs. The latter, on the other hand, finds one

treated subject and then selects the untreated subject with closest propensity score as control.

The greedy matching algorithm chooses the order in which the treated subjects are matched

either randomly, sequentially from lowest to highest propensity or from highest to lowest, or

by the order of how close the best match is.

It is also common to put restrictions on the quality of the matches. This is called nearest

neighbour matching with calipers. Caliper matching is similar to NNM but if there is no un-

treated subject with a propensity score within a specified distance from the selected treated sub-

ject, the treated subject will not be included in the matched sample and the analysis. Austin10

recommends using a caliper of 0.2 standard deviations of the logit of the propensity score. The

choice of using calipers or not represents a trade off between two sources of bias. If no re-

strictions are put on the quality of the matches, or if the calipers are wide, the distribution of

propensity scores between the treated group and the control group might still differ substan-

tially after matching. If calipers are used, some treated subjects might get excluded from the

analysis and thus changing the target population.

Austin11 compared several different 1:1 matching algorithms in a simulation study where

differences in means were estimated. It was found that optimal matching had no advantage

over greedy NNM in achieving balance in the baseline covariates. NNM with calipers resulted

in estimates with less bias but slightly higher variability than NNM without restriction. For

4

greedy NNM, the order in which the treated subjects were selected did not have any particular

effect on estimation. Also, matching with replacement did not have better performance than

matching without replacement. Based on this, they recommend using NNM with calipers and

random order.

Two different methods for matching will be evaluated in this study: 1:1 greedy nearest

neighbour matching, with and without calipers.

2.2.2 Inverse Probability of Treatment Weighting

Inverse probability of treatment weighting (IPTW) uses the propensity score to compute weights.

These weights are used to construct a synthetic sample where the distribution of measured co-

variates is independent of treatment assignment. The weights can be chosen so they represent

different target populations, either to target the ATE or the ATT. We will focus on the ATT

weights since this is the measure that the 1:1 matching targets. Let Zi be the treatment indi-

cator and ei be the estimated propensity score for the ith subject . The ATT weights are then

given by

wi = Zi +ei(1− Zi)(1− ei)

. (1)

Thus, all treated subjects are given a weight of 1 and the subjects in the control group are

given a weight of ei/(1− ei).

2.3 Survival Analysis and the Cox Proportional Hazards model

Survival analysis, also called time-to-event analysis, is a collection of statistical procedures for

analysing data where the variable of interest is the time until an event occurs. The event of

interest can be for example death or relapse from remission. A common feature of these data is

that observations are censored. Censoring is a type of missing data problem that can occur due

to several reasons. In medicine, a frequent type is right censoring which means that all that is

known is that subject is still alive at a given time. Right censoring can for example occur when

the study is terminated before all subjects have experienced the event or if a subject leaves the

study before having experienced the event. Then we will only know that the subject was still

alive when it was censored but not the time when the event was experienced.

When analysing the effect of a treatment, usually both an absolute and a relative measure

of the effect is of interest. The techniques that are most commonly used to deal with censored

5

time-to-event data is the Kaplan-Meier estimator, which estimates the survival function, and

Cox proportional hazards (PH) model, which is used to model the effect of covariates on the

hazard rate. The hazard function is defined

h(t) = lim∆t→0

P[t ≤ T < t+ ∆t|T ≥ t]

∆t(2)

which is the probability that a subject will not survive for an additional time ∆t given that

it has survived until time t.

In the Cox PH model12, the effect of a one unit increase in a covariate has a multiplicative

effect on the hazard rate

h(t;X) = h0(t)eθZ+Xβ (3)

where h(t;X) is the hazard function at time t, h0(t) is the baseline hazard, θ is the treatment

effect, Z the treatment indicator, X is a vector of covariates and β is a vector of parameters.

The Cox PH model is semi-parametric since the functional form of the baseline hazard does

not need to be specified.

In the Cox model, θ is the treatment effect conditional on the covariates X. A measure of

an effect is said to be collapsible if the marginal and the conditional effect coincide in absence

of confounding13. That is true for linear models but not for hazard ratios or odds ratios in

general14. Regardless of whether X is related to Z, controlling for X will infer different

estimands. While conditional effects denote an average effect at the individual level, marginal

effects denote an effect at the population level. Thus, the marginal effect is the hazard ratio

of two identical populations, except that in one population all subjects received treatment and

in the other everyone was untreated. In that sense, adjusting for covariates at the design stage

by propensity score methods and controlling for the covariates in the regression analysis will

estimate different effects.

A key assumption of the Cox PH model is the proportional hazard assumption, meaning

that the effect of a covariate is constant over time. Chastang et al.15 showed that omission of

a covariate in a Cox PH model can cause bias, even if the covariate is completely balanced

between the treatment and the control groups. This results from the hazards in the two groups

no longer being proportional if a prognostic variable is omitted from the model. Even when the

PH assumption is violated, the estimate from a Cox model can still be useful. It can be seen as

6

a geometric average of the treatment effect over the support of the data. However, this estimate

will be a function of the distribution of censoring times16.

2.4 Previous Research

Gayat et al.8 evaluate the performance of stratification on the propensity score, covariate ad-

justment using the propensity score and propensity score matching. They concludled that the

methods, except matching, estimated conditional effects rather than marginal effects. How-

ever, matching on the propensity score gave unbiased estimates of the marginal hazard ratio.

The methods were evaluated with sample sizes of 1000 subjects and a censoring rate of 40 %.

They also investigated the effect of an unmeasured confounder. The unmeasured confounder

led to substantially biased estimates but replacing the unmeasured confounder with a highly

correlated variable could remove most of the bias.

Austin9 compares the performance of IPTW and propensity score matching for estimat-

ing marginal hazard ratios in a simulation study of several scenarios of different hazard ratios

and prevalences of exposure. It was found that both methods yield approximately unbiased

estimates of the effect in the treated population. IPTW had lower mean squared errors and

the difference between the methods increased when the prevalence of exposure was low. The

methods were evaluated in samples of 10 000 subjects and the data were not censored.

Pirracchio et al.6 examined the performance of IPTW and propensity score matching for

estimation of odds ratios in case of small samples. They found that both methods could yield

estimates with less than 10 % bias for sample sizes from 1000 down 40 subjects. IPTW had

better performance of bias and MSE except when the sample size was 40. For the smallest

sample size, matching yielded estimates with slightly lower bias.

3 Methodology

A Monte Carlo simulation study was performed to evaluate the behaviour of bias and MSE

when using propensity score methods to estimate marginal hazard ratios. To examine the per-

formance of the methods under different conditions, we used three slightly modified data gen-

erating processes to represent different situations. These simulated situations will be referred

to as Scenario A, B and C. In each scenario, samples of different sizes were generated. Also,

censoring times were simulated from different distributions. In each simulated data set, the

7

propensity score methods were applied separately and marginal Cox PH models were fitted.

The estimates were compared to the true treatment effect. Here follows a detailed description

of the simulation design and the estimation methods.

3.1 Data Generating Processes

3.1.1 Scenario A

Three baseline covariates, X1, X2 and X3, were simulated from independent standard normal

distributions. Of these, the first two affected treatment selection whereas the last two affected

the outcome. Thus, X2 is the only true confounder.

Each subject’s probability of being assigned to treatment was simulated using a logistic model

logit(pi) = log(α0) + log(α1)x1i + log(α2)x2i (4)

where α1 = α2 = 2 and α0 was chosen so that 30% of the subjects were treated. The treat-

ment status for each subject was then generated from a Bernoulli distribution with individual

parameter pi.

The event times were generated following the technique described by Bender et al.17. First,

a linear predictor was defined

LPi = θZi + log(β2)x2i + log(β3)x3i (5)

.

where β2 and β3 also were set to 2. These parameters were chosen so that the variables

should have a substantial impact on treatment selection and/or outcome.

The time-to-event, T, was simulated by inverting the cumulative hazard function of a

Weibull distribution,

Ti =

(−log(ui)

λexp(LPi)

)1/η

(6)

where ui ∼ U(0, 1). The scale parameter λ and the shape parameter η were set to 2 and

0.00002 respectively, as have been done in other studies9,18.

8

3.1.2 Scenario B

In Scenario B, a second confounder was added. Four variablesX1, ..., X4, were simulated from

independent standard normal distributions. Now, X1, ..., X3 affected treatment selection and

X2, ..., X4 affected the outcome. Thus, X2 and X3 are confounders.

The probability of assignment was simulated as in Scenario A, but adding the third variable

to the data generating process

logit(pi) = log(α0) + log(α1)x1i + log(α2)x2i + log(α3)x3i (7)

.

Also in this scenario, α0 was chosen so that 30% of the subjects received treatment. Event

times were generated as in Scenario A, with the variable X4 added to the linear predictor

LPi = θZi + log(β2)x2i + log(β3)x3i + log(β4)x4i (8)

.

Again, the parameters α1, ..., α3 and β2, ..., β4 were set to 2.

3.1.3 Scenario C

Scenario C is also a slightly modified version of Scenario A. However, α0 was changed so that

10% of the subjects received treatment. In all other aspects, the data generating processes were

the same as in Scenario A with three simulated variables of which one was a true confounder.

3.1.4 Censoring Times

To simulate censoring times Ci, two different distributions were used. First, censoring was

simulated from the Weibull distribution. The shape parameter η was set to 0.00002 and the

scale parameter λ was changed to obtain the desired censoring rate. A uniform distribution was

also considered, where Ci ∼ unif(0, b), changing the parameter b to obtain different rates.

3.1.5 Conditional Treatment Effect

Since the survival times were generated from a conditional model, exp(θ) is the conditional

treatment effect. An iterative bisection method was used to determine the conditional effect

9

Table 1: Summary of simulation settings

Scenario Data generating process Number ofconfounders

Samplesizes

Prevalenceof treatment

A logit(pi) = α0 + log(2)x1i + log(2)x2i 1 60,...,1000 30 %LPi = θZi + log(2)x2i + log(2)x3i

B logit(pi) = α0 + log(2)x1i + log(2)x2i + log(2)x3i 2 60,...,1000 30 %LPi = θZi + log(2)x2i + log(2)x3i + log(2)x4i

C logit(pi) = α0 + log(2)x1i + log(2)x2i 1 180,...,3000 10 %LPi = θZi + log(2)x2i + log(2)x3i

that induced the desired marginal hazard ratio. This method is described in detail by Austin to

generate data with a specified marginal odds ratio and has also been applied to hazard ratios9,19.

Briefly, a data set of 1000 subjects was simulated. For each treated subject, the two potential

outcomes were generated as described above. However, the survival times were not censored

since it most often is the true uncensored effect that is of scientific interest. Survival times

were then regressed on the treatment indicator using Cox regression. Over 10 000 replications,

the mean of the parameter estimates was taken as the true marginal hazard ratio in the treated

population corresponding to the conditional effect θ. We then searched for the value of θ that

gave the desired marginal hazard ratio.

3.2 Simulation Design

The true marginal hazard ratio was chosen to be 1.5. This was obtained by setting the condi-

tional hazard ratio exp(θ) ≈ 1.859 in Scenario A, exp(θ) ≈ 1.983 in Scenario B and exp(θ) ≈

1.864 in Scenario C. Seven different censoring distributions were simulated: the situation of no

censoring and censoring rates of 20% , 40% and 50% from uniform and Weibull distributions,

respectively. In each combination of censoring distribution and Scenario A-C, samples of dif-

ferent sizes were generated. In Scenario A and B, the samples ranged from 1000 down to 60

subjects (n = 1000, 900, 800, 700, 600, 500, 400, 300, 200, 180 , 160, 140, 120, 100, 80, 60).

In order to get a reasonable number of observations in the treated group in Scenario C, we mul-

tiplied the total sample sizes by three (n ranging from 3000 down to 180). Thus, the expected

number of treated subject were the same as in the other two scenarios. A summary of the three

scenarios is presented in Table 1.

10

3.3 Estimation

Within each simulated data set, the propensity scores were estimated using logistic regression.

The propensity scores were estimated using four different models: a model including only the

true confounder(s) (M1), a model including the variables related to outcome only (M2), a model

including the variables related to treatment selection only (M3) and finally a model including

all variables related to either outcome or treatment (M4).

The three different propensity score methods were applied: i)1:1 greedy nearest neigh-

bour matching (NNM), ii) 1:1 greedy nearest neighbour matching within a caliper 0.2 st.d. of

the logit of the propensity score (Caliper matching) and iii) inverse probability of treatment

weighting (IPTW). In the matched samples, a univariate Cox PH model was used to regress

time-to-event on the treatment indicator. For IPTW, a Cox PH model was also used to regress

time-to-event on the treatment indicator, including the ATT weights as sample weights. For

comparison, also an unadjusted model in the original sample was estimated (crude model), as

well as a correctly specified conditional model (adjusted model). Since we suspect that the

marginal hazards might violate the PH assumption we will check the proportionality by also

estimating hazard ratios for different time intervals.

The methods were evaluated with relative bias and mean squared error (MSE) on the log-

hazard scale

Bias =1

5000

5000∑i=1

δ̂i − δδ

(9)

MSE =1

5000

5000∑i=1

(δ̂i − δ)2 (10)

where δ̂i is the estimated log-hazard ratio of the ith simulated dataset and δ is the true

marginal log-hazard ratio in the treated population. When evaluating the conditional model, δ

was defined as the true conditional effect.

All simulations were performed in R version 3.5.1. The function glm was used to fit logistic

regressions to estimate the propensity scores. The function Match in the package Matching was

used for matching on the propensity score. To estimate Cox proportional hazard models, coxph

in the survival package was used.

11

4 Results

For each propensity score method, four different logistic models were considered when es-

timating the propensity score: a model including only the true confounder(s) (M1), a model

including all variables related to the outcome (M2), a model including all variables related to

treatment selection (M3) and a model including all variables related to either treatment selec-

tion or outcome (M4). In terms of bias, the differences between the four models were small

for Caliper matching and IPTW. M1 and M2 had slightly lower bias than M3 and M4. The

differences between the models were larger for NNM, where M1 and M2 produced less biased

estimated than did M3 and M4. There were however no large differences between M1 and

M2. When comparing MSE, M2 was better than the other three models in all cases but a few

exceptions across all simulations. There were no clear patterns in the relative performance of

the four models across the three simulated scenarios, sample sizes or censoring distributions.

In the remainder of this thesis, we will focus on the results produced by the model including all

variables related to the outcome, M2.

Table 2 shows some descriptive statistics, averaged over the 5000 Monte Carlo replica-

tions, that describes the quality of the matching methods and the IPTW. Since censoring does

not have any impact on the estimation of the propensity scores, only the results for the data

simulated in the case of no censoring are presented. First, in Scenario A with n = 1000,

95.7 % of the the treated subjects were successfully matched to a control subject with similar

propensity score when calipers were applied. As the sample size decreases, so does the share

of matched subjects. Thus, there is a relatively low risk of bias due to incomplete matching

when n = 1000, but the risk increases as the sample size decreases. In Scenario B, where a

second confounder is added, a lower share of the subjects are matched compared to Scenario

A with only one confounder. In Scenario C, 10 % of the subjects received the treatment. The

relatively larger number of control subjects makes it easier to find a match within the specified

calipers. Compared to Scenario A, only a small number of treated subjects are excluded from

the analysis. In NNM, all treated subjects are included in the analysis. Without any restriction

on the quality of the matches, there are no guarantees that the distribution of the covariates

will be similar in the matched sample. The second panel in Table 2 shows the difference in

means between the treated group and the control group in the matched sample, averaged over

the Monte Carlo replications. The variable that is related to treatment only is not included in

the propensity score model and the imbalance between the groups remains after matching. The

12

Table 2: Share of treated subjects that were included in the matched sample, differences inmeans between treated and control in the matched sample and variance of IPT weights, aver-aged over 5000 Monte Carlo replications

Method Scenarion = 1000 500 200 140 100 80 60

Sharematched (%)

CaliperA 95.7 94.5 91.7 89.7 87.3 85.3 81.9B 90.2 89.1 86.0 84.2 81.3 79.0 75.9

n = 3000 1500 600 420 300 240 180C 99.8 99.4 98.7 98.3 97.8 97.1 96.2

n = 1000 500 200 140 100 80 60

Differencein means,matchedsample

NNM

AX1 (treatment) .628 .628 .628 .623 .625 .619 .623X2 (confounder) .045 .053 .070 .084 .096 .109 .128X3 (outcome) .000 .000 -.001 -.002 .001 -.002 .000

B

X1 (treatment) .616 .616 .614 .612 .607 .603 .607X2 (confounder) .091 .096 .110 .118 .133 .146 .152X3 (confounder) .091 .096 .110 .118 .134 .147 .156X4 (outcome) .001 .000 .001 -.002 .002 -.004 .000

n = 3000 1500 600 420 300 240 180

CX1 (treatment) .658 .657 .658 .660 .659 .662 .650X2 (confounder) .002 .003 .007 .009 .011 .015 .019X3 (outcome) .001 .001 .001 -.004 -.002 .001 -.004

n = 1000 500 200 140 100 80 60

Varianceof weights

IPTWA .130 .133 .143 .152 .160 .168 .190B .205 .210 .237 .246 .286 .321 .342

n = 3000 1500 600 420 300 240 180C .077 .077 .078 .078 .078 .079 .080

Abbreviations: Nearest neighbour matching (NNM), NNM with calipers (Caliper), inverse probabilityof treatment weighting (IPTW)

variable that is related only to the outcome, was balanced between the groups before match-

ing. The true confounder(s) have a substantial imbalance before matching, which is to a large

extent eliminated in the matched sample. The differences between the groups increase as the

sample size decreases. Also, the difference between groups is larger when adding a second

confounder in Scenario B, and lower when a smaller share of the subjects is treated in Scenario

C. In summary, the risk of bias increases in smaller samples for both matching methods. Either

due to larger residual imbalance in the confounder or that a larger share of the target population

is excluded from the analysis. For IPTW, the variance of the weights increases as the sample

sizes decreases. Also, compared to Scenario A, the variances are larger in B and smaller in C.

13

Figure 1: True marginal hazard ratio over time and averages

Figure 1 shows the true marginal hazard ratio in the treated population over time. It was

computed as described in Section 3.1.5 by simulating both potential outcomes, but the hazard

ratios were estimated for different time intervals. Included in the graphs are also estimates from

the correctly specified conditional model, as well as the crude marginal model and the marginal

model in matched samples, using caliper matching. The data were censored with a rate of 40

% from the Weibull distribution, except when computing the true marginal hazard ratio. The

dashed lines are the time average hazard ratios. The graphs clearly show that the marginal ratio

is not constant. Over time, the hazards of the two groups converge. Thus, the time average

estimates will depend on follow-up time and the censoring distributions. On the other hand, the

correctly specified conditional model estimates constant hazard ratios

Figure 2 shows the relative bias for the different methods in Scenario A, with censoring

times from the Weibull distribution. Compared to the crude estimates, all three methods provide

a substantial reduction in bias. In case of no censoring, the crude model yields estimates with a

bias of approximately 70% when the sample size is 1000. Of the three methods, IPTW produces

results with minimum bias. As the censoring rate increases, the bias increases. Censoring does

not seem to have any particular effect on the relative performance of the three methods. Since

14

Figure 2: Relative bias. Scenario A (one confounder, 30 % treated)

Figure 3: Mean squared error. Scenario A (one confounder, 30 % treated)

15

the marginal hazards in the groups are nonproportional, the time average estimates are expected

to be a function of the censoring distribution. If subjects with longer survival times are more

likely to be censored, the hazard ratios at later time points will get a smaller weight in the

average hazard ratio. Since the hazards are convergent, this will result in overestimation of

the average hazard ratio. Similar shifts in the bias were obtained when censoring times were

simulated from a uniform distribution, however the shifts were slightly smaller. The bias of the

adjusted conditional model is not affected by the censoring distribution. Note however that this

model estimates and was compared to a different effect.

The mean squared errors for Scenario A are presented in Figure 3. Also in terms of MSE,

IPTW yields the best results. NNM and Caliper matching have similar MSE. As the sample

size gets smaller, the MSE of caliper matching increase sligtly more rapidly.

In Scenario B, where a second confounder is added, Figure 4 shows that bias increase for all

models. Also in this scenario, IPTW has the best performance for all sample sizes. Compared

to Scenario A, the use of calipers when constructing the matched sample seems to have a larger

effect on reducing the bias. Also in Scenario B, censoring of the data causes similar shifts in

the bias curves. The MSE, presented in Figure 5, shows that IPTW have the best performance

across all sample sizes. Caliper have lower MSE compared to NNM, except for the smallest

sample sizes.

Figure 6 shows the relative bias for Scenario C. Also in this scenario, IPTW has the lowest

bias. However, the differences between the three methods are small. When the sample size is

180, with 18 treated subjects on average, the bias of IPTW is 5.5 % in case of no censoring.

The MSE are presented in Figure 7. NNM and caliper matching have similar values for all

sample sizes. IPTW have the smallest MSE, approximately half of those for NNM and caliper

matching, and the difference gets larger for the smallest sample sizes. Thus, the variance of

the IPTW estimates seems to be less sensitive to sample size. The bias and MSE are almost

identical for NNM and caliper matching.

More comprehensive results are available on request: [email protected].

16

Figure 4: Relative bias. Scenario B (two confounders, 30 % treated)

Figure 5: Mean squared error. Scenario B (two confounders, 30 % treated)

17

Figure 6: Relative bias. Scenario C (one confounder, 10 % treated)

Figure 7: Mean squared error. Scenario C (one confounder, 10 % treated)

18

5 Discussion

In this thesis we have evaluated the performance of inverse probability of treatment weight-

ing (IPTW) and nearest neighbour matching (NNM), with and without calipers, to estimate

marginal hazard ratios (MHR) of a treatment effect in the treated population. The methods

were evaluated based on bias and MSE in a series of simulations.

First we examined different models to estimate the propensity score. It was confirmed

that for all three propensity score methods, the model that included all variables related to the

outcome, regardless of whether they were confounders, had the best performance. These results

are in agreement with other simulation studies5,6. All further comparisons of the propensity

score methods were based on this model.

The results show that the propensity score methods can adjust for measured confounders

and produce approximately unbiased estimates of MHRs in the absence of censoring. IPTW

had better performance across sample sizes and the different data-generating processes com-

pared to the matching methods, both in terms of bias and MSE. When there was one confounder,

IPTW produced estimates with bias of less than 10 % for sample sizes down to 60 subjects.

Adding a second confounder to the data-generating process resulted in slightly higher biases

for all methods. When a smaller proportion of the subjects received treatment, the differences

in bias and MSE between the methods decreased. Thus, a relatively large number of control

subjects was more important when using the matching methods. As the sample size decreased,

the bias increased for all methods. For the matching methods, it is possible that it is more

difficult to find a control subject with similar propensity score in a smaller sample. A possible

explanation for the increase in bias for IPTW could be the larger variability of the weights in

the smaller samples. Comparing the larger sample sizes, these results are in agreement with

Austin’s9 who simulated data without censoring and with sample sizes of 10 000 subjects.

When the data were censored, this introduced bias that increased with the censoring rate. It

has been showed by Chastang et al.15 that the PH assumption is violated, even if a completely

balanced covariate is omitted from the Cox PH model. When examining the proportionality of

the hazards, it was confirmed that conditional hazards were proportional whereas the marginal

hazards were convergent. This increase in bias is likely a result of the nonproportionality of the

hazards. When the hazards are nonproportional, the hazard ratios from the Cox PH regression

can be interpreted as a geometric average, weighted by the number of observations at each time

point. When the censoring rate was 50 %, the maximum increase in bias was about 15 percent-

19

age points. Gayat et al.8 simulated data with sample sizes of 1000 subjects and evaluated NNM.

With 40 % censoring rate they found, unlike this study, that it had approximately unbiased esti-

mates of the marginal hazard ratio. The differences between their results and the current study

could perhaps be due to that they took the censoring distribution into account when computing

the true marginal hazard ratio by simulating both potential outcomes and censoring the data.

Even if treatment assignment were completely randomised in a RCT, a marginal Cox model

would still yield estimates with similar bias, if the survival data came from the same conditional

model with the same distribution of censoring times. Three reviews of randomised clinical trials

recently published in oncology journals show that treatment effects are often evaluated using

marginal Cox models20–22. Also, the PH assumption was only justified or discussed in few of

the articles. As previous studies have noted, the propensity score methods estimates marginal

rather than conditional effects9. Although not specifying the correct conditional model violates

the assumptions of the Cox regression, propensity score methods can be a useful alternative to

estimate similar effects to those that are commonly reported in RCTs.

Aalen et al.23 discuss a problem of randomisation with the Cox model. Even if the treat-

ment assignment is randomised at the start of the study, this randomisation is lost by implicit

conditioning as soon has the first event occurs if the outcome model is not correctly specified.

Thus, the marginal hazards in these simulations might lack a clear causal interpretation.

There are certain limitations in this study. Since the models have been evaluated based on

simulations, it is difficult to know how generalisable the results are to other situations than the

specific settings in these simulations. For example, there could be different numbers of co-

variates or distributions of the covariates such as binary or correlated variables, different sizes

of treatment effects and other prevalences of treatment. Also, in reality it is unlikely that all

confounders are measured. However, these methods are difficult to evaluate analytically and

simulation studies are important in order to understand under which conditions the methods

work well. There are also several other matching algorithms, such as matching with replace-

ment, that perhaps could have given different results.

In summary, IPTW had consistently the lowest bias and MSE across the simulations. IPTW

and the matching methods had most similar performance when both sample size and the ratio

of the number of control to treated subjects were large. IPTW was less sensitive to sample size

and the number of control subjects.

20

References

1 Rubin D. Estimating causal effects of treatments in randomised and nonrandomised studies.

Journal of Educational Psychology. 1974 mar;66(6):688–701.

2 Austin PC. The use of propensity score methods with survival or time-to-event outcomes:

reporting measures of effect similar to those used in randomized experiments. Statistics

in Medicine. 2014 mar;33(7):1242–1258. Available from: https://www.ncbi.nlm.nih.gov/

pmc/articles/PMC4285179/.

3 Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies

for causal effects. Biometrika. 1983;70(1):41–55. Available from: https://academic.oup.

com/biomet/article/70/1/41/240879.

4 Rubin DB, Thomas N. Matching Using Estimated Propensity Scores: Relating Theory to

Practice. Biometrics. 1996 mar;52(1):249.

5 Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Vari-

able selection for propensity score models. American journal of epidemiology. 2006

jun;163(12):1149–56. Available from: https://www.ncbi.nlm.nih.gov/pubmed/16624967.

6 Pirracchio R, Resche-Rigon M, Chevret S. Evaluation of the Propensity score methods

for estimating marginal odds ratios in case of small sample size. BMC Medical Research

Methodology. 2012 dec;12(1):70. Available from: https://www.ncbi.nlm.nih.gov/pubmed/

22646911.

7 Rosenbaum PR. Model-Based Direct Adjustment. Journal of the American Statistical

Association. 1987 jun;82(398):387. Available from: https://www.jstor.org/stable/2289440.

8 Gayat E, Resche-Rigon M, Mary JY, Porcher R. Propensity score applied to survival data

analysis through proportional hazards models: a Monte Carlo study. Pharmaceutical Statis-

tics. 2012 may;11(3):222–229. Available from: http://doi.wiley.com/10.1002/pst.537.

9 Austin PC. The performance of different propensity score methods for estimating marginal

hazard ratios. Statistics in Medicine. 2013 jul;32(16):2837–2849. Available from: http:

//doi.wiley.com/10.1002/sim.5705.

21

10 Austin PC. Optimal caliper widths for propensity-score matching when estimating dif-

ferences in means and differences in proportions in observational studies. Pharmaceutical

Statistics. 2011 mar;10(2):150–161. Available from: http://doi.wiley.com/10.1002/pst.433.

11 Austin PC. A comparison of 12 algorithms for matching on the propen-

sity score. Statistics in medicine. 2014 mar;33(6):1057–69. Available from:

http://www.ncbi.nlm.nih.gov/pubmed/24123228http://www.pubmedcentral.nih.gov/

articlerender.fcgi?artid=PMC4285163.

12 Cox DR. On collapsibility and confounding bias in Cox and Aalen regression models.

Journal of the Royal Statistical Society. 1972;66(34):187–220.

13 Greenland S, Robins JM, Pearl J. Confounding and Collapsibility in Causal Inference;

1999. 1. Available from: https://projecteuclid.org/euclid.ss/1009211805.

14 Martinussen T VS. On collapsibility and confounding bias in Cox and Aalen regression

models. Lifetime data analysis. 2013 jun;66(9):279–96.

15 Chastang C, Byar D, Piantadosi S. A quantitative study of the bias in estimating the

treatment effect caused by omitting a balanced covariate in survival models. Statistics in

Medicine. 1988 dec;7(12):1243–1255. Available from: http://doi.wiley.com/10.1002/sim.

4780071205.

16 Boyd AP, Kittelson JM, Gillen DL. Estimation of treatment effect under non-

proportional hazards and conditionally independent censoring. Statistics in medicine. 2012

dec;31(28):3504–15. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22763957http:

//www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3876422.

17 Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional

hazards models. Statistics in Medicine. 2005 jun;24(11):1713–1723. Available from: http:

//doi.wiley.com/10.1002/sim.2059.

18 Austin PC, Stuart EA. The performance of inverse probability of treatment weighting and

full matching on the propensity score in the presence of model misspecification when es-

timating the effect of treatment on survival outcomes. Statistical Methods in Medical Re-

search. 2017;26(4):1654–1670.

22

19 Austin PC, Stafford J. The Performance of Two Data-Generation Processes for Data with

Specified Marginal Treatment Odds Ratios. Communications in Statistics - Simulation and

Computation. 2008 may;37(6):1039–1051. Available from: http://www.tandfonline.com/

doi/abs/10.1080/03610910801942430.

20 Rulli E, Ghilotti F, Biagioli E, Porcu L, Marabese M, D’Incalci M, et al. Assess-

ment of proportional hazard assumption in aggregate data: a systematic review on sta-

tistical methodology in clinical trials using time-to-event endpoint. British Journal of

Cancer. 2018 dec;119(12):1456–1463. Available from: http://www.nature.com/articles/

s41416-018-0302-8.

21 Chai-Adisaksopha C, Iorio A, Hillis C, Lim W, Crowther M. A systematic review of using

and reporting survival analyses in acute lymphoblastic leukemia literature. BMC Hema-

tology. 2016 dec;16(1):17. Available from: http://bmchematol.biomedcentral.com/articles/

10.1186/s12878-016-0055-7.

22 Batson S, Greenall G, Hudson P. Review of the Reporting of Survival Anal-

yses within Randomised Controlled Trials and the Implications for Meta-Analysis.

PloS one. 2016;11(5):e0154870. Available from: http://www.ncbi.nlm.nih.gov/pubmed/

27149107http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4858202.

23 Aalen OO, Cook RJ, Røysland K. Does Cox analysis of a randomized survival study yield a

causal treatment effect? Lifetime Data Analysis. 2015 oct;21(4):579–593. Available from:

https://www.ncbi.nlm.nih.gov/pubmed/26100005.

23

THE PERFORMANCE OF INVERSE PROBABILITY OF TREATMENT ...

Documents

Transcript of THE PERFORMANCE OF INVERSE PROBABILITY OF TREATMENT ...