Learning Individual Treatment E ects from Networked ...Learning Individual Treatment E ects from...

Learning Individual Treatment Effects fromNetworked Observational Data

Ruocheng Guo1, Jundong Li1, and Huan Liu1

Arizona State University, Tempe, AZ 85287, USA{rguo12,jundongl,huan.liu}@asu.edu

Abstract. With convenient access to observational data, learning in-dividual causal effects from such data draws more attention in manyinfluential research areas such as economics, healthcare, and education.For example, we aim to study how a medicine (treatment) would affectthe health condition (outcome) of a certain patient. To validate causalinference from observational data, we need to control the influence ofconfounders - the variables which causally influence both the treatmentand the outcome. Along this line, existing work for learning individualtreatment effect overwhelmingly relies on the assumption that there areno hidden confounders. However, in real-world observational data, thisassumption is untenable and can be unrealistic. In fact, an importantfact ignored by them is that observational data can come with networkinformation that can be utilized to infer hidden confounders. For ex-ample, in an observational study of the individual treatment effect of amedicine, instead of randomized experiments, the medicine is assigned toindividuals based on a series of factors. Some factors (e.g., socioeconomicstatus) are hard to measure directly and therefore become hidden con-founders of observational datasets. Fortunately, the socioeconomic statusof an individual can be reflected by whom she is connected in social net-works. With this fact in mind, we aim to exploit the network structure torecognize patterns of hidden confounders in the task of learning individ-ual treatment effects from observational data. In this work, we proposea novel causal inference framework, the network deconfounder, whichlearns representations of confounders by unraveling patterns of hiddenconfounders from the network structure between instances of observa-tional data. Empirically, we perform extensive experiments to validatethe effectiveness of the network deconfounder on various datasets.

Keywords: Individual treatment effect · Confounding bias · Networkedobservational data.

1 Introduction

Recent years have witnessed the rocketing availability of observational data ina variety of highly influential research areas such as economics, healthcare, andeducation. Such data enables researchers to investigate the fundamental problemof learning individual-level causal effects of a certain interesting treatment (e.g.,

arX

iv:1

906.

0348

5v1

[cs

.SI]

8 J

un 2

019

2 R. Guo et al.

medicine) on an important outcome (e.g., health condition) without performingrandomized experiments which can be rather expensive, time consuming, andeven unethical [9,6]. For example, the easy access of a sea of electronic healthrecords ease the studies of individual treatment effect of a medicine on patients’health conditions.

Compared to data collected through randomized experiments, an observa-tional dataset is often effortless to obtain and often comes with a large numberof instances and an affluent set features. Meanwhile, the instances are often in-herently connected with rich auxiliary structure information such as the socialnetworks connecting multiple online users. Learning individual treatment effectsfrom observational data requires us to handle confounding bias. We say thereexists confounding bias when the correlation between the outcome and the treat-ment is distorted by the existence of confounders (a.k.a., the variables causallyinfluence both the treatment and the outcome). For example, the poor socioeco-nomic status of an individual can limit her access to an expensive medicine andhave negative impact on her health condition at the same time. Thus, withoutcontrolling the influence of the socioeconomic status, we may overestimate thetreatment effect of the expensive medicine. Controlling the confounding bias isknown as the main challenge of learning individual treatment effects from obser-vational data [18,6]. To deal with these confounders, a vast majority of existingindividual treatment estimation methods rely on the strong ignoralbility assump-tion [26,23,7,8,24] that all the confounders can be measured and are embeddedin the set of observed features. As such, these methods often exploit the availablefeatures to mitigate the confounding bias. In the running example, most of exist-ing efforts try to eliminate the influence of socioeconomic status on the chanceto take the medicine and the health condition through controlling the impactof the related proxy variables such as annual income, age, and education. How-ever, for observational data, given the fact that the causal relationships betweenvariables are unknown, the strong ignoralbility assumption becomes untenableand it is likely to be unrealistic due to the existence of hidden confounders [18].Recently, a series of methods are proposed to leverage representation learningto relax the strong ignorability assumption. Nonetheless, they still assumed thatwe were able to extract a set of latent features as the set of confounders fromobservational data using neural networks or factor models [14,27].

Despite the existing methods mentioned above, few have recognized the im-portance of the network structures connecting instances in the task of learningindividual treatment effects. In fact, topology of instances is ubiquitous in vari-ous observational data such as a social network of patients, an electrical grid ofpower stations, and a spatial network of geometric objects, to name a few. Inaddition, when it is notoriously hard to measure some confounders, alternatively,we can capture their influence by incorporating the underlying network struc-tures. Back to the running example, although the socioeconomic status of anindividual is often difficult to be quantified by observed instance features, it canbe implicitly represented by her social network patterns such as how many peo-ple are following her in the social network. Surprisingly, little attention has been

Learning Individual Treatment Effects from Networked Observational Data 3

paid to utilizing network structure patterns to mitigate the confounding bias andthen achieve precise estimation of individual treatment effects. To bridge the gap,in this work, we focus on leveraging network structural patterns along with ob-served features to minimize confounding bias in the task of individual treatmenteffect estimation. It worth noting that this work is significantly different fromthe existing studies on spillover effect, also known as network entanglement orinterference [25,19], where the treatment on an instance may causally influencethe outcomes of the connected units. In contrast, we focus on the situationswhere network structure can be exploited for controlling confounding bias. Forexample, a patient’s network patterns reflect her socioeconomic status but herhealth condition is not likely to be causally affected by what treatments areassigned to her neighbors.

To exploit network structure patterns for controlling the hidden confounders,we propose the network deconfounder, a novel framework that captures the influ-ence of hidden confounders. Fig. 1 illustrates the workflow of the proposed net-work deconfounder framework. In particular, the network deconfounder learnsrepresentations of confounders by mapping the original features as well as thenetwork structure into a shared latent feature space. Then the representationsof confounders can be exploited to control confounding bias and learn individualtreatment effects from observational data.

Here, we summarize the main contributions of this work as follows:

– We formulate the novel problem of learning individual treatment effects fromnetworked observational data.

– We propose a novel framework for learning individual treatment effects fromnetworked observational data – network deconfounder, which controls con-founding bias and estimates individual treatment effects given observationaldata with auxiliary network information.

– We perform extensive experiments to show that the proposed network de-confounder significantly outperforms the state-of-the-art methods for learn-ing individual treatment effects across two semi-synthetic datasets based onreal-world social network data.

2 Problem Statement

In this section, we start with an introduction of technical preliminaries andthen formally present the problem of learning individual treatment effects fromnetworked observational data.

First, we describe the notations used in this work. We denote a scalar, avector, and a matrix with a lowercase letter (e.g., t), a boldface lowercase letter(e.g., x), and a boldface uppercase letter (e.g., A), respectively. Subscripts signifyelement indexes (e.g., xi and Ai,j). Superscripts of the a potential outcomevariable denotes its corresponding treatment (e.g., yti). Table 1 shows a summaryof notations that are frequently referred to throughout this work.

4 R. Guo et al.

Original Features

NetworkStructure

Representationof

Confounders

Treatment

y1

y0

Inferred Potential

Outcomes

RepresentationBalancing Loss

Fig. 1. The work flow of the proposed network deconfounderTable 1. Notations

Notation Definition and Description

xi features of the i-th instanceti observed treatment of the i-th instanceA adjacency matrix of the networkhi representation of hidden confounders of instance i

yFi , yCFi observed outcome and counterfactual outcome of the i-th instance

yti potential outcome of the i-th instance with treatment t

n number of instancesm dimension of the feature spaced dimension of the representation space

Then we introduce networked observational data. In this work, we aim tolearn individual treatment effects from networked observational data. Such datacan be represented as ({xi, ti, yi}ni=1,A) where xi, ti and yi denote the fea-tures, the observed treatment, and the observed (factual) outcome of the i-thinstance, respectively. The symbol A signifies the adjacency matrix of the auxil-iary network information among different data instances. Here, we assume thatthe network is undirected and all the edges share the same weight1. Therefore,A ∈ {0, 1}n×n and Ai,j = Aj,i = 1 (Ai,j = Aj,i = 0) denotes that there isan (no) edge between the i-th instance and the j-th instance. We focus on thecases where the treatment variable takes binary values t ∈ {0, 1}. Without lossof generality, we use ti = 1 (ti = 0) to imply that the i-th instance is under treat-ment (control). We also let the outcome variable be a scalar and take continuousreal values as y ∈ R. Then we introduce the background knowledge of learn-ing individual treatment effects. To define individual treatment effect (ITE), westart with the definition of potential outcomes which is widely used in the causalinference literature [17,20]:

1 This work can be directly applied to weighted undirected networks. It can also beextended to directed networks using the Graph Convolutional Neural Networks fordirected networks [16]


Definition 1. Potential Outcomes. Given an instance i and the treatmentt, the potential outcome of i under treatment t, denoted by yti , is defined as thevalue of y would have taken if the treatment of instance i had been set to t.

Then we are able to provide mathematical formulation of ITE for the i-th in-stance in the setting of networked observational data as:

τi = τ(xi,A) = E[y1i |xi,A]− E[y0i |xi,A] (1)

Intuitively, ITE is defined as the expected outcome under treatment subtractedby the expected outcome under control, which reflects how much improvement ofthe outcome is caused by the treatment. It is worth noting that with the networkinformation, we are able to go beyond the limited information provided by thefeatures and distinguish two instances with the similar features but differentnetwork patterns in the task of learning individual treatment effects. With ITEdefined, we can formulate the average treatment effect (ATE) by taking theaverage of ITE over the instances as: ATE = 1n

∑ni=1 τi. Finally, we formally

present the definition of the problem of learning individual treatment effectsfrom networked observational data as follows:

Definition 2. Learning Individual Treatment Effects from NetworkedObservational Data. Given the networked observational data ({xi, ti, yi}ni=1,A),we aim to develop a causal inference model to learn the individual treatment ef-fects τi of each instance i.

3 The Proposed Framework

3.1 Background

It is not difficult to find that as only one of the two potential outcomes can beobserved, the main challenge of learning individual treatment effects is to inferthe counterfactual outcome yCFi = y

1−tii . In previous work [7,26,8,24], with the

strong ignorability assumption, controlling observed features is often consideredto be enough to eliminate confounding bias. Formally, strong ignorability can bedefined as:

Definition 3. Strong Ignorability. With strong ignorability, it is assumed:(1) The potential outcomes of an instance are independent of whether it receivestreatment or control given its features. (2) For each instance the probability toget treated is larger than 0 and less than 1. Formally, given the set of possiblevalues of features X , we can write the strong ignorability as:

y1, y0 ⊥⊥ t|x and 1 > Pr(t = 1|x) > 0,∀x ∈ X , t ∈ {0, 1}. (2)

It implies E[yt|x] = E[y|x, t] due to the independence between the treatment andthe potential outcomes, where y denotes the outcome resulting from the featuresx and the treatment t. Based on this observation, many existing methods boil

6 R. Guo et al.

t xy

h A

Fig. 2. The assumed causal graph: the network structure A along with the observedfeatures x are proxy variables of the hidden confounders h, which can be utilized tolearn representation of hidden confounders. The directed edges signify causal relation-ships, solid circles represent observed variables, and the dashed circle stand for hiddenconfounders.

down the task of learning ITE from observational data to learning a modelf : X × {0, 1} → R to approximate E[y|x, t].

However, in this work, we assume that there exist unobserved confounders.As a result, inferring counterfactual outcomes based on the features and thetreatment alone would result in biased estimator (E[y|x, t] 6= E[yt|x]) becausedependencies between the treatment variable and the two potential outcomevariables are introduced by hidden confounders.

3.2 Network Deconfounder

In this subsection, we propose the network deconfounder, a novel frameworkthat addresses the challenges of learning individual treatment effects from net-worked observational data. Given the adjacency matrix A, feature vector x, thetreatment t, and the outcome y, Fig. 2 shows the causal model assumed in theframework of network deconfounder. Instead of relying on the strong ignorabil-ity assumption, we assume that both the features and the network structure aretwo sets of proxy variables of the hidden confounders, which is a more practicalassumption than strong ignorability. For example, although we cannot directlymeasure socioeconomic status of an individual, we can collect features such asage, job type, zip code, and the social network to describe her socioeconomicstatus. Based on this assumption, network deconfounder attempts to learn repre-sentations to approximate hidden confounders and estimate ITE from networkedobservational data simultaneously.

Unlike eliminating the confounding bias based on the features, leveragingthe underlying network structure for controlling confounding bias raises specialchallenges: (1) instances are inherently interconnected with each other throughthe network structure and hence they are not independent identically distributed(i.i.d.), (2) the adjacency matrix of a network is often high-dimensional (A ∈{0, 1}n×n) and can be very sparse.

To tackle these special challenges of controlling confounding bias with net-work structure information, we propose the network deconfounder framework.The task can be divided into two steps. First, we aim to learn representationsof hidden confounders by mapping the features and the network structure intothe representation space. Then an output function is learned to infer potential


outcomes based on the treatment and the representation of hidden confounders.Then we present how the two tasks are accomplished by the network decon-founder.Learning Representation of Confounders. The first component of networkdeconfounder is a representation learning function g that can map the featuresand the underlying network into the latent space of hidden confounders, whichcan be formulated as g : X × A → Rd. We parameterize the g function usingGraph Convolutional Neural Networks (GCN) [4,11], whose effectiveness havebeen verified in various machine learning tasks across different types of networkeddata. To the best of our knowledge, this is the first work introducing GCN to thetask of learning individual treatment effects. In particular, the representation ofconfounders of the i-th instance is learned through GCN layers. Here, we describethe representation learning function g with a single GCN layer for the simplicityof notation. The representation learning function g is parameterized as:

hi = g(xi,A) = σ((ÂX)iU), (3)

where Â denotes the normalized adjacency matrix, (ÂX)i signifies the i-th row

of the matrix product ÂX, U ∈ Rm×d represents the weight matrix, and σstands for the ReLU activation function [5]. Specifically, with Ã = A + In and

D̃j,j =∑

j Ãj,j defined, the normalized adjacency matrix Â can be calculatedusing the renormalization trick [11]:

Â = D̃−12 ÃD̃−

12 (4)

We can compute Â in a pre-processing step to avoid repeating the computation.Then the weight matrix U ∈ Rm×d along with the ReLU activation functionmaps the input signal into the low-dimensional representation space. It is worthnoting that more than one GCN layers can be stacked to catch the non-linearitybetween hidden confounders and the input data.Inferring Potential Outcomes. Then we introduce the second component ofnetwork deconfounder, the output function f : Rd×{0, 1} → R, which maps therepresentation of hidden confounders as well as the treatment to the correspond-ing potential outcome. With hi ∈ Rd denoting the representation of the hiddenconfounders of the i-th instance and t ∈ {0, 1} signifying the treatment, to inferthe corresponding potential outcome, we parameterize the output function f as:

f(hi, t) =

{f1(hi) if t = 1

f0(hi) if t = 0, (5)

where f1 and f0 are the output functions for treatment t = 1 and t = 0. Specif-ically, we parameterize the output functions f1 and f0 using L fully connectedlayers and an output regression layer as:

f1 = w1σ(W1L...σ(W

11hi)), f0 = w

0σ(W0L...σ(W01hi)), (6)

where hi is the representation of hidden confounders (output of the g function)of the i-th instance, {Wtl}, l = 1, ..., L denote the weight matrices of the fully

8 R. Guo et al.

connected layers, and wt is the weight for the regression prediction layers. Thebias terms of the fully connected layers and the output regression layer aredropped for simplicity of notation. We can either set t = ti to infer the observedfactual outcome yCFi or t = 1− ti to estimate the counterfactual outcome.

With the two components of network deconfounder formulated, given thefeatures of the i-th instance xi, the treatment t, and the adjacency matrix A,we can infer the potential outcome as:

ŷti = f(g(xi,A), t), (7)

where ŷti denotes the inferred potential outcome of instance i corresponding totreatment t by the network deconfounder framework.Objective Function. Then we introduce three essential components of the lossfunction for the proposed network confounder.

Factual Outcome Inference. First, we aim to minimize the error of inferringthe observed factual outcomes. This leads to the first component of the lossfunction, the mean squared error on the inferred factual outcomes:

1

n

N∑i=1

(ŷtii − yi)2. (8)

Representation Balancing. Minimizing the error on the factual outcomes doesnot necessarily mean the error on counterfactual outcomes is also minimized. Inother words, in the problem of learning ITE from observational data, we es-sentially confront the challenge of domain adaptation [8,24]. In particular, thenetwork deconfounder would be trained on the conditional distribution of factualoutcomes Pr(yFi |xi,A, ti) but the task is to infer the conditional distribution ofcounterfactual outcomes Pr(yCFi |xi,A, 1 − ti). In [24, Lemma 1.], the authorshave shown that the inference error on the counterfactual outcomes is upper-bounded by a weighted sum of (1) the error on the factual outcomes; and (2)the integral probability metrics (IPM) measuring the difference between the dis-tributions the treated instances and the controlled instances in terms of theirconfounder representations. Therefore, we also aim to minimize the IPM mea-suring the divergence between the distributions of the two treatment groups re-garding their representations of hidden confounders. With P (h) = Pr(h|ti = 1)and Q(h) = Pr(h|ti = 0) being the empirical distributions of representation ofhidden confounders, we let ρZ(P,Q) denote the IPM defined in the functionalspace Z measuring the distance between the two distributions of representationsof hidden confounders. Assuming that Z denotes the set of 1-Lipschitz functions,the IPM reduces to the Wasserstein-1 distance which is defined as:

ρZ(P,Q) = infk∈K

∫h∈{hi}i:ti=1

||k(h)− h||P (h)dh (9)

where K = {k|k : Rd → Rd s.t. Q(k(h)) = P (h)} denotes the set of push-forward functions that can transform the representation distribution of the


treated P to that of the controlled Q. By minimizing αρZ(P,Q), we approx-imately minimize the gap between the distributions of representation of latentconfounders, where α ≥ 0 signifies the hyperparameter controlling the trade-off between representation balancing and the other terms. We use the efficientapproximation algorithm proposed by [3] to compute the Wasserstein distance(Eq. (9)) and its gradients for training the model.

`2 Regularization. Third, we let θ signify the vector of the model parame-ters of network deconfounder. Then a squared `2 norm regularization term onthe model parameters - λ||θ||22, is added to mitigate the overfitting problem,where λ ≥ 0 denotes the hyperparamter controlling the trade-off between the `2regularization and the other two terms.

Formally, We present the objective function of the network deconfounder as:

L({xi, ti, yi}ni=1,A) =1

n

N∑i=1

wi(ŷtii − yi)

2 + αρZ(P,Q) + λ||θ||22, (10)

4 Experiments

4.1 Dataset Description

It is notoriously hard to obtain ground truth for ITE because for a vast major-ity of cases, we can only observe one of the potential outcomes. For example, apatient can only choose to take the medicine or not to take it, but not both. Toresolve this problem, we follow the existing literature [8,24,14,22] to create semi-synthetic datasets. In particular, we introduce two new networked observationaldatasets which have ground truth features, network structures, synthetic treat-ments, and outcomes for the task of learning ITE from networked observationaldata in the presence of hidden confounders.BlogCatalog. BlogCatalog2 is an online community where users can post blogs.Each instance is a blogger. Each edge signifies the social relationship betweentwo bloggers. The features are the keywords in bloggers’ blog descriptions. Weextend the BlogCatalog dataset used in [12,13] by synthesizing (a) the outcome– the opinions of readers on each blogger; and (b) the treatment – whethercontent from a blogger is read more on mobile devices or on desktops. Similarto the News dataset used in previous work [8,23,22], we make the followingassumptions: (1) Readers either read on mobile devices or desktops. We say ablogger get treated (controlled) if her blogs are read more on mobile devices(desktops). (2) Readers prefer to read some topics from mobile devices, othersfrom desktops. (3) A blogger and her neighbors’ topics influence treatment. (4)A blogger and her neighbors’ topics causally affect readers’ opinions on them.Here, we aim to study the causal effect of being read more on mobile deviceson readers’ opinions of each blogger. To synthesize treatments and outcomes inaccordance to the assumptions mentioned above, we train a topic model on alarge set of documents. Then two centroids in the topic space are defined as: (i)

2 https://www.blogcatalog.com/

10 R. Guo et al.

we randomly sample a blogger and let the topic distribution of her descriptionbe the centroid of the treated instances, denoted by rc1. (ii) The centroid of thecontrolled, rc0, is selected to be the centroid of the topic distribution of all thebloggers’ description. Then we introduce how the treatments and outcomes aresynthesized based on the similarity between the topic distribution of a blogger’sdescription and the two centroids. With r(xi) denoting the topic distribution ofthe i-th blogger’s description, we model the preference of the readers of the i-thblogger’s content as:

Pr(t = 1|xi,A) =exp(pi1)

exp(pi1) + exp(pi0)

pi1 = κ1r(xi)T rc1 + κ2

∑j∈N (i)

r(xj)T rc1 = κ1r(xi)

T rc1 + κ2(Ar(xj))T rc1

pi0 = κ1r(xi)T rc0 + κ2

∑j∈N (i)

r(xj)T rc0 = κ1r(xi)

T rc0 + κ2(Ar(xj))T rc0

(11)

where κ1, κ2 ≥ 0 signifies the strength of the confounding bias resulting from ablogger’s topics and her neighbors’ topics. When κ1 = 0, κ2 = 0 the treatmentassignment is random and the greater the value κ is, the more significant the biasof device preference is. Then factual outcome of the i-th blogger is simulated as:

yF (xi) = C(pi0 + tip

i1) + �, (12)

where C is a scaling factor and the noise is sampled as � ∼ N (0, 1). In this work,we set C = 5, κ1 = 10, κ2 ∈ {0.5, 1, 2}. 50 LDA topics are learned from thetraining corpus. Then we reduce the vocabulary by taking the union of the mostprobable 100 words from each topic. By doing this, we end up with 2,173 bag-of-word features. We perform the above mentioned simulation 10 times for eachsetting of κ2. Figure 3 shows the distribution of topics in one of the simulationswhich is projected to two-dimensional space using TSNE [15]. We observe thatthere are more treated instances (red dots) near the centroid rc1 (green diamond)and more control instances (blue dots) close to the centroid rc0 (yellow diamond).In addition, a significant shift from the centroids can be perceived which showsthe impact of the network structure.

Flickr. Flickr3 is an image and video sharing service. Each instance refers toa user and each edge represents the social relationship between two users. Thefeatures of each user represent a list of tags of interest. We adopt the samesettings and assumptions as we do for the BlogCatalog dataset. Thus, we alsostudy the individual-level causal effects of being viewed on mobile devices onreaders’ opinions on the user. In particular, we also learn 50 topics from thetraining corpus using LDA and concatenate the top 25 words of each topic.Thus, we reduce the data dimension to 1,210. We maintain the same settings ofparameters as the BlogCatalog dataset (C = 5, κ1 = 10 and κ2 ∈ {0.5, 1, 2}).

3 https://www.flickr.com


75 50 25 0 25 50 7575

50

25

0

25

50

75

Fig. 3. Distribution of treated (red) and control (blue) instances in the topic space.The green and yellow diamonds signify the centroids rc1 and r

c0.

Table 2. Dataset Description

Dataset Name # Instances # Edges # Features κ2 Average ATE STD ATE

BlogCatalog 5,196 173,468 8,1890.5 4.366 0.5531 7.446 0.7592 13.534 2.309

Flickr 7,575 239,738 12,0470.5 6.672 3.0681 8.487 3.3722 20.546 5.718

In Table 2, we present a summary of the statistics of the semi-syntheticdatasets described in this subsection. The average and standard deviation ofATE are calculated over the 10 runs under each setting of parameters.

4.2 Experimental Settings

Following the original implementation of GCN [11]4, we train the model with allthe training instances along with the complete adjacency matrix. ADAM [10] isthe optimizer we use to minimize the objective function of the network decon-founder (Eq. (10)). We randomly sample 60% and 20% of the instances as thetraining set and validation set and let the remaining be the test set. We perform10 times of random sampling for each simulation of the datasets and report theaverage results. Grid search is applied to find the optimal combination of hyper-parameters for the network deconfounder. In particular, we search learning ratein {10−1, 10−2, 10−3, 10−4}, the number of output layers in {1, 2, 3}, dimension-ality of the outputs of the GCN layers and the number of hidden units of thefully connected layers in {50, 100, 200}, α and λ in {10−3, 10−4, 10−5, 10−6}. Forthe baseline methods, we adopt the default setting of hyperparameters5.

4 https://github.com/tkipf/gcn5 As a reproducible research paper, the code of this work can be find at

https://github.com/rguo12/network-deconfounder/

12 R. Guo et al.

Then, we describe the baselines methods which are the state-of-the-art meth-ods for learning ITE from observational data:Counterfactual Regression (CFR) [24]. CFR is based on the strong ignorabilityassumption. It learns representations of confounders by mapping the originalfeatures into a latent space. It minimizes the factual outcome inference errorand the representation balancing error. Following [24], two types of represen-tation balancing error are considered: Wasserstein-1 distance (CFR-Wass) andmaximum mean discrepancy (CFR-MMD).Treatment-agnostic Representation Networks (TARNet) [24]. TARnet is a vari-ant of CFR without the representation balancing penalty term.Causal Effect Variational Autoencoder (CEVAE) [14]. CEVAE is a deep latent-variable model developed for learning ITE. It learns representation of confoundersas Gaussian distributions through propagating information from original fea-tures, observed treatments, and factual outcomes.Causal Forest [26]. Causal Forest is an extension of Breiman’s random forest [1]for learning ITE. It works with the strong ignorability assumption.Bayesian Additive Regression Trees (BART) [7]. BART is a Bayesian regressiontree based ensemble model which is widely used in the literature of learning ITE.It is also based on the strong ignorability assumption.

Two widely used evaluation metrics, the Rooted Precision in Estimation ofHeterogeneous Effect (

√�PEHE) and Mean Absolute Error on ATE (�ATE), are

adopted by this work. Formally, they are defined as:

√�PEHE =

√1

n

∑i=1

(τ̂i − τi)2, �ATE = |1

n

∑i=1

(τ̂i)−1

n

∑i=1

(τi)|, (13)

where τ̂i = ŷ1i − ŷ0i and τi = y1i − y0i denote the inferred ITE and the ground

truth ITE for the i-th instance.

4.3 Results

Effectiveness. First, we compare the effectiveness of the proposed networkdeconfounder with the aforementioned state-of-the-art methods. Table 3 sum-marizes the empirical results evaluated on the BlogCatalog and Flickr datasetswith C = 5, κ1 = 10 and κ2 ∈ {0.5, 1, 2}. We summarize the observations fromthese experimental results as follows:

– The proposed network deconfounder consistently outperforms the state-of-the-art baseline methods on the semi-synthetic datasets with treatmentsand outcomes generated under various settings. We perform one-tailed T-test to verify the statistical significance and the results indicate that thenetwork deconfounder achieves significantly better estimations on individualtreatment effects with a significant level of 0.05.

– Benefiting from the capability to capture the patterns of hidden confoundersfrom the network structure, the network deconfounder suffers the least whenthe influence of hidden confounders grows (from κ2 = 0.5 to κ2 = 2) in termsof the increase in the error metrics

√�PEHE and �ATE .


Table 3. Experimental Results.

BlogCatalog

κ2 0.5 1 2√�PEHE �ATE

√�PEHE �ATE

√�PEHE �ATE

NetDeconf (ours) 4.532 0.979 4.597 0.984 9.532 2.130

CFR-Wass 10.904 4.257 11.644 5.107 34.848 13.053

CFR-MMD 11.536 4.127 12.332 5.345 34.654 13.785

TARNet 11.570 4.228 13.561 8.170 34.420 13.122

CEVAE 7.481 1.279 10.387 1.998 24.215 5.566

Causal Forest 7.456 1.261 7.805 1.763 19.271 4.050

BART 4.808 2.680 5.770 2.278 11.608 6.418

Flickr

κ2 0.5 1 2√�PEHE �ATE

√�PEHE �ATE

√�PEHE �ATE

NetDeconf (ours) 4.286 0.805 5.789 1.359 9.817 2.700

CFR-Wass 13.846 3.507 27.514 5.192 53.454 13.269

CFR-MMD 13.539 3.350 27.679 5.416 53.863 12.115

TARNet 14.329 3.389 28.466 5.978 55.066 13.105

CEVAE 12.099 1.732 22.496 4.415 42.985 5.393

Causal Forest 8.104 1.359 14.636 3.545 26.702 4.324

BART 4.907 2.323 9.517 6.548 13.155 9.643

Parameter Study. Then we investigate how the values of the two importanthyperparameters α and λ affect the performance of the network deconfounder.Here, we fix the learning rate to be 10−2, the number of epochs to be 200, thenumber of GCN layers and the number of output layers to be 2, the numberof hidden units and the dimensionality of the representations to be 100. Thenwe vary α and λ in the range of {1, 10−2, 10−4, 10−6}. The results are shown inFig. 4. Due to the space limit, we only report the results on the BlogCatalogdataset with κ2 = 1 in terms of both error metrics

√�PEHE and �ATE . Based on

the observations that the√�PEHE and �ATE do not change significantly when

α ≤ 10−6 and 10−6 ≤ λ ≤ 1, we can conclude that the performance of thenetwork deconfounder is not sensitive to both parameters α and λ. However,when α is too large (α > 0.01), the performance of the network deconfounderdegrades. This is because when α is too large, the objective function wouldemphasize the importance of balancing the representation of confounders of thetwo treatment groups too much and sacrifice the accuracy on inferring ITE.

5 Related Work

In this section, we present an introduction of two directions of related work:learning ITE from observational data and graph convolutional neural networks.Learning Individual Treatment Effects Recently, the fundamental problemof learning ITE from observational data has attracted considerable attention inmany high-impact research areas including statistics, machine learning, and data

14 R. Guo et al.

1e-060.0001

0.011

1e-060.0001

0.011

4.60

4.70

4.80

4.90

5.00

4.60

4.65

4.70

4.75

4.80

(a)√�PEHE of BlogCatalog (κ2 = 1)

1e-060.0001

0.011

1e-060.0001

0.011

1.001.021.041.061.081.10

1.00

1.02

(b) �ATE of BlogCatalog (κ2 = 1)

Fig. 4. Impact of α and λ on the performance of the proposed framework.

mining. Hill [7] proposed to apply BART to learn ITE because it requires littlehyperparameter tuning. Wager and Athey [26] proposed the Causal Forest, whichextended Breiman’s random forest [1] for learning ITE. In [8,24], two methodsare proposed to learn representations of confounders using neural networks. Thework also showed that balancing the distributions of confounders of the treatedand controlled instances in the representation space can improve the performancein the task of learning ITE. However, the methods mentioned above rely on thestrong ignorability assumption to handle the hidden confounders, which is oftenuntenbale and can be unrealistic in real-world observational datasets. Recently,Louizos et al. [14] proposed to consider observed features as proxy variables ofreal confounders and use a deep latent-variable model to learn representation ofconfounders through variational inference. However, none of the previous workutilized the network structure to capture patterns of hidden confounders.

Graph Convolutional Neural Networks. Previous work on Graph Convolu-tional Neural Networks (GCN) mainly focused on development of spatially local-ized6 and computationally efficient convolutional filters for network data. Brunaet al. [2] proposed to use the first-order graph Laplacian matrix as the filter in thespectrum domain. However, this filter has a large number of trainable parametersand its the spatial locality is not guaranteed. In [4], Defferrard et al. introduceda more efficient and properly localized filter for graph convolution. This filter isparameterized as l-th order polynomials of the graph Laplacian matrix to ensurethe locality. Then the polynomials are approximated by its Chebyshev expan-sion to reduce the computational cost. Then, Kipf and Welling [11] proposed therenormalization trick to further improve the computational efficiency of GCN.Recently, GCN has also been applied to solve a variety of problems on networkdata such as recommendation systems [28] and knowledge graph [21]. Differentfrom the existing work, this paper is the first work exploiting GCN for the causalinference problem - learning ITE from observational data.

6 Here, spatial locality refers to the constraint that information of a node only prop-agates to its l-hop neighbors.


6 Conclusion

New challenges are presented by the prevalence of networked observational datafor learning individual treatment effects. In this work, we study a novel prob-lem, learning individual treatment effects from networked observational data.As the underlying network structure could capture useful information of hiddenconfounders, we propose the network deconfounder framework, which leveragesthe network structural patterns along with original features for learning betterrepresentations of confounders. Empirically, we perform extensive experimentsacross multiple real-world datasets. Results show that the network deconfounderlearns better representation of confounders than the state-of-the-art methods.

Here, we also introduce two most interesting directions of future work. First,we are interested in leveraging other types of structure between instances forlearning ITE from observational data. For example, temporal dependencies canalso be utilized to capture patterns of hidden confounders. Second, in this work,focus on static network structure. But the real-world networks can evolve overtime. Hence, we would like to investigate how to exploit dynamics in evolvingnetworks for learning ITE.

References

1. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)2. Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally

connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)3. Cuturi, M., Doucet, A.: Fast computation of wasserstein barycenters. In: Interna-

tional Conference on Machine Learning. pp. 685–693 (2014)4. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on

graphs with fast localized spectral filtering. In: Advances in neural informationprocessing systems. pp. 3844–3852 (2016)

5. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Pro-ceedings of the fourteenth international conference on artificial intelligence andstatistics. pp. 315–323 (2011)

6. Guo, R., Cheng, L., Li, J., Hahn, P.R., Liu, H.: A survey of learning causality withdata: Problems and methods. arXiv preprint arXiv:1809.09337 (2018)

7. Hill, J.L.: Bayesian nonparametric modeling for causal inference. Journal of Com-putational and Graphical Statistics 20(1), 217–240 (2011)

8. Johansson, F., Shalit, U., Sontag, D.: Learning representations for counterfactualinference. In: International conference on machine learning. pp. 3020–3029 (2016)

9. Kallus, N., Zhou, A.: Confounding-robust policy improvement. In: Advances inNeural Information Processing Systems. pp. 9289–9299 (2018)

10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 (2016)

12. Li, J., Hu, X., Tang, J., Liu, H.: Unsupervised streaming feature selection in socialmedia. In: Proceedings of the 24th ACM International on Conference on Informa-tion and Knowledge Management. pp. 1041–1050. ACM (2015)

16 R. Guo et al.

13. Li, J., Hu, X., Wu, L., Liu, H.: Robust unsupervised feature selection on networkeddata. In: Proceedings of the 2016 SIAM International Conference on Data Mining.pp. 387–395. SIAM (2016)

14. Louizos, C., Shalit, U., Mooij, J.M., Sontag, D., Zemel, R., Welling, M.: Causal ef-fect inference with deep latent-variable models. In: Advances in Neural InformationProcessing Systems. pp. 6446–6456 (2017)

15. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of MachineLearning Research 9, 2579–2605 (2008)

16. Monti, F., Otness, K., Bronstein, M.M.: Motifnet: a motif-based graph convolu-tional network for directed graphs. In: 2018 IEEE Data Science Workshop (DSW).pp. 225–228. IEEE (2018)

17. Neyman, J.S.: On the application of probability theory to agricultural experiments.essay on principles. section 9.(tlanslated and edited by dm dabrowska and tp speed,statistical science (1990), 5, 465-480). Annals of Agricultural Sciences 10, 1–51(1923)

18. Pearl, J., et al.: Causal inference in statistics: An overview. Statistics surveys 3,96–146 (2009)

19. Rakesh, V., Guo, R., Moraffah, R., Agarwal, N., Liu, H.: Linked causal variationalautoencoder for inferring paired spillover effects. In: Proceedings of the 27th ACMInternational Conference on Information and Knowledge Management. pp. 1679–1682. ACM (2018)

20. Rubin, D.B.: Bayesian inference for causal effects: The role of randomization. TheAnnals of statistics pp. 34–58 (1978)

21. Schlichtkrull, M., Kipf, T.N., Bloem, P., Van Den Berg, R., Titov, I., Welling,M.: Modeling relational data with graph convolutional networks. In: EuropeanSemantic Web Conference. pp. 593–607. Springer (2018)

22. Schwab, P., Linhardt, L., Bauer, S., Buhmann, J.M., Karlen, W.: Learning coun-terfactual representations for estimating individual dose-response curves. arXivpreprint arXiv:1902.00981 (2019)

23. Schwab, P., Linhardt, L., Karlen, W.: Perfect match: A simple method for learningrepresentations for counterfactual inference with neural networks. arXiv preprintarXiv:1810.00656 (2018)

24. Shalit, U., Johansson, F.D., Sontag, D.: Estimating individual treatment effect:generalization bounds and algorithms. In: Proceedings of the 34th InternationalConference on Machine Learning-Volume 70. pp. 3076–3085. JMLR. org (2017)

25. Toulis, P., Volfovsky, A., Airoldi, E.M.: Propensity score methodology inthe presence of network entanglement between treatments. arXiv preprintarXiv:1801.07310 (2018)

26. Wager, S., Athey, S.: Estimation and inference of heterogeneous treatment effectsusing random forests. Journal of the American Statistical Association 113(523),1228–1242 (2018)

27. Wang, Y., Blei, D.M.: The blessings of multiple causes. arXiv preprintarXiv:1805.06826 (2018)

28. Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W.L., Leskovec, J.: Graphconvolutional neural networks for web-scale recommender systems. In: Proceedingsof the 24th ACM SIGKDD International Conference on Knowledge Discovery &Data Mining. pp. 974–983. ACM (2018)

Learning Individual Treatment Effects from Networked Observational Data

Learning Individual Treatment E ects from Networked ...Learning Individual Treatment E ects from...

Documents

Transcript of Learning Individual Treatment E ects from Networked ...Learning Individual Treatment E ects from...