Constraint-based Causal Discovery with Mixed...

Noname manuscript No.(will be inserted by the editor)

Constraint-based Causal Discovery with Mixed Data

Michail Tsagris · Giorgos Borboudakis · Vincenzo Lagani · IoannisTsamardinos

Received: date / Accepted: date

Abstract We address the problem of constraint-basedcausal discovery with mixed data types, such as (butnot limited to) continuous, binary, multinomial and or-dinal variables. We use likelihood-ratio tests based onappropriate regression models, and show how to derivesymmetric conditional independence tests. Such testscan then be directly used by existing constraint-basedmethods with mixed data, such as the PC and FCIalgorithms for learning Bayesian networks and maximalancestral graphs respectively. In experiments on simu-lated Bayesian networks, we employ the PC algorithmwith different conditional independence tests for mixeddata, and show that the proposed approach outperformsalternatives in terms of learning accuracy.

Keywords Constraint-based learning · Bayesiannetworks · Maximal ancestral graphs · Mixed data ·Conditional independence tests

Michail TsagrisDepartment of Computer Science, University of Crete, Herak-lion, GreeceTel.: +30-2810-393577E-mail: [email protected]

Giorgos BorboudakisDepartment of Computer Science, University of Crete andGnosis Data Analysis (Gnosis DA), Heraklion, GreeceTel.: +30-2810-393577E-mail: [email protected]

Vincenzo LaganiDepartment of Computer Science, University of Crete andGnosis Data Analysis (Gnosis DA), Heraklion, GreeceTel.: +30-2810-393578E-mail: [email protected]

Ioannis TsamardinosDepartment of Computer Science, University of Crete andGnosis Data Analysis (Gnosis DA), Heraklion, GreeceTel.: +30-2810-393575E-mail: [email protected]

1 Introduction

Typically, datasets contain different variable types, suchas continuous (e.g. temperature), nominal (e.g. sex),ordinal (e.g. movie ratings) or censored time-to-event(e.g. customer churn), to name a few. Furthermore, datamay be measured over time (e.g. longitudinal data) orwithout considering time (e.g. cross-sectional data). Suchheterogeneous data are not exceptions, but the normin many domains (e.g. biomedicine, psychology andbusiness). In such cases, it is important and necessary toapply causal discovery methods that are able to handlemixed data types.

Unfortunately, most current approaches do not han-dle heterogeneous variable types. Constraint-based meth-ods, like the PC and FCI algorithms [37] for Bayesiannetwork (BN) and maximal ancestral graph (MAG)learning respectively, are general methods that use con-ditional independence tests to learn the causal network.Thus, in principle, they can be applied to heterogeneousvariable types, as long as an appropriate conditionalindependence test is employed. For continuous variables,typical choices are the partial correlation test [3] orkernel-based tests [46]. Categorical variables are usuallyhandled with the X2 test or the G-test [1]. Similarly,most score-based methods, such as the K2 [9] and GES[7] algorithms for Bayesian network learning, employscores for categorical variables [9, 17] or for continuousvariables only [15]. Although there exist both constraint-based [4, 27, 10] and score-based [16, 14, 30, 4] ap-proaches for learning with mixed data, they are limitedin the variable types they can handle, and are too com-putationally expensive or make unrealistic assumptions.

In this work, we propose a simple and general methodto handle mixed variables. We show how to deal withmixtures of continuous, binary, nominal and ordinal

2 Michail Tsagris et al.

variables, although the same methodology can be usedto derive tests for other data types, such as count data,proportions (percentages), positive and strictly positivedata, censored data, as well as robust versions for het-eroscedastic data; see the R package MXM [24] for a listof available tests. Those tests can be directly plugged-into existing constraint-based learning algorithms, suchas the PC and FCI algorithms. Naturally, the proposedmethod is not limited to BN and MAG learning algo-rithms, but can be used with any algorithm that employsconditional independence tests, such as algorithms forMarkov network structure discovery [6] or for featureselection [38].

We employ likelihood-ratio tests based on regres-sion models to devise conditional independence testsfor mixed data. A likelihood-ratio test for conditionalindependence of variables X and Y given a (possiblyempty) set of variables Z can be performed by fittingtwo regression models for X, one using Z and one withY ∪ Z, and comparing their goodness-of-fit. Under thenull hypothesis of conditional independence, both mod-els should fit the data equally well, as the inclusion ofY does not provide any additional information for Xonce Z is accounted for. Alternatively, one can flip Xand Y , and fit two regression models for Y instead. Un-fortunately, those tests do not necessarily give the sameresults, especially for low sample scenarios, and thusare not symmetric. Symmetry is an important property,as the test decisions should not depend on the variableorder.

In simulated experiments, we demonstrate that inthe sample limit and by using appropriate regressionmodels, both tests return the same p-value and thusare asymptotically symmetric. To handle finite sam-ple cases, we consider different approaches to obtainsymmetry, such as performing both tests and combiningthem appropriately, or by performing only one test in anorder-invariant fashion using predefined rules (similar to[34]). Finally, we evaluate two proposed symmetric tests(one of each category) against an alternative conditionalindependence test for mixtures of ordinal and continu-ous variables [10] on simulated Bayesian networks, andshow that the symmetric test based on performing twoasymmetric likelihood-ratio tests tests, called MM, out-performs the rest.

2 Preliminaries

2.1 Bayesian networks and maximal ancestral graphs

A Bayesian network (BN) [31, 37] B = 〈G,P 〉 consistof a directed acyclic graph G over vertices (variables)V and a joint probability distribution P . P is linked

to G through the Markov condition, which states thateach variables is conditionally independent of its non-descendants given its parents. The joint distribution Pcan then be written as

P (V1, . . . , Vn) =

p∏i=1

P (Vi|Pa(Vi))

where p is the total number of variables in G and Pa(Vi)denotes the parent set of Vi in G. If all conditional inde-pendencies in P are entailed by the Markov conditionthe Bayesian network is called faithful. Furthermore,Bayesian networks assume causal sufficiency, that is,that there are no latent confounders between variablesin V. A causal Bayesian network is a Bayesian networkwhere edges are interpreted causally. Specifically, anedge X → Y exists if X is a direct cause of Y in thecontext of the measured variables V. Typically, multipleBNs encode the same set of conditional independencies.Such BNs are called Markov equivalent, and the set ofall Markov equivalent BNs forms a Markov equivalenceclass. This class can be represented by a completedpartially directed acyclic graph (PDAG), which in ad-dition to directed edges also contains undirected edges.Undirected edges may be oriented either way in someBN in the Markov equivalence class (although not allcombinations are possible), while directed and missingedges are shared among all equivalent networks. Twoclasses of algorithms for Bayesian network learning areconstraint-based and score-based methods. Constraint-based learning algorithms, such as the PC algorithm[37], employ conditional independence tests to discoverthe structure of the network, and perform an orientationphase afterwards to orient (some of) the edges, and aPDAG is returned. Score-based methods [9, 17, 7] assigna score on the whole network based on how well it fitsthe data, and perform a search in the space of Bayesiannetworks or PDAGs to identify a high-scoring network.

Maximal ancestral graphs (MAG) [33] are general-izations of Bayesian networks that admit the presence oflatent confounders, and thus drop the causal sufficiencyassumption. In addition to directed edges, they also con-tain bi-directed edges, which encode dependencies due tolatent confounders. As for BNs, multiple Markov equiv-alent networks may exist, forming a Markov equivalenceclass which can be represented by a graph called partialancestral graph (PAG). The FCI algorithm [37, 45], anextension of the PC algorithm, outputs such a PAG.

2.2 Conditional independence tests

Let X and Y be two random variables, and Z be a (pos-sibly empty) set of random variables. X and Y are condi-tionally independent given Z, if P (X,Y |Z) = P (X|Z) ·

Constraint-based Causal Discovery with Mixed Data 3

P (Y |Z) holds for all values of X, Y and Z. Equivalently,conditional independence of X and Y given Z impliesP (X|Y,Z) = P (X|Z) and P (Y |X,Z) = P (Y |Z). Suchstatements can be tested using conditional independencetests. Examples of commonly employed conditional in-dependence tests are the partial correlation test [3] forcontinuous multivariate Gaussian variables, and the G-test and the (asymptotically equivalent) X2 test [37, 1]for categorical variables. All aforementioned tests areeither likelihood-ratio tests or approximations of them;see [8] for the relation of partial correlation tests andF-tests, and [1] for the connections of the G-test tolog-linear models and likelihood-ratio tests.

Likelihood-ratio tests, or asymptotically equivalentapproximations thereof such as score tests or Wald tests,can be used to compare the goodness-of-fit of nestedstatistical models. Examples of statistical models arelinear regression, binary logistic regression, multinomialregression and ordinal regression. Two models are callednested, if one model is a special case of the other. LetM0 (reduced model) be a model for X using Z and M1

(full model) be a model for X using Y ∪Z. M0 is nestedwithinM1, asM1 can be transformed intoM0 by simplysetting the coefficients of Y to zero. We proceed with abrief description of the likelihood-ratio test; implemen-tation details are considered in Section 2.3. Let LL(M)

be the log-likelihood of a model M , and let Par(M)

be the number of parameters in M . The test statisticT of a nested likelihood-ratio test between M0 and M1

equals T = 2 · (LL(M1)− LL(M0)), and follows asymp-totically a χ2 distribution with Par(M1) − Par(M0)

degrees of freedom [42]. It is important to note thatthis result assumes that the larger hypothesis M1 iscorrectly specified, that is, that its assumptions are met(such as its functional form) and that all necessary vari-ables are included. In case of model misspecification, thelikelihood-ratio test statistic follows a different distribu-tion [13], and should be handled appropriately [41, 40].This topic is out of the scope of the current paper, andwill not be further considered hereafter.

Note that, if the models M0 and M1 fit the dataequally well, and thus are equivalent, it implies that Xand Y are conditionally independent given Z (assumingagain, correct model specification), as Y does not pro-vide any additional information for X once Z is given.We will use this property to show how to implementconditional independence tests for mixed variable typesin Section 4.

2.3 Implementing likelihood-ratio tests with mixed data

Without loss of generality, we assume hereafter that Yis the outcome variable, and likelihood-ratio tests are

performed using regressions on Y . In order to fit a regres-sion model M for variable Y using mixed continuous,nominal and ordinal variables, the nominal and ordinalvariables have to be first transformed appropriately. LetX be a categorical (nominal or ordinal) variable, takingdX distinct values. X can be used in M by transform-ing X into dX − 1 dummy binary variables (also calledindicator variables). Note that, dX − 1 variables areused instead of one for each value of X, as the excludedone can be determined given the others. The degrees offreedom of variable X is denoted as Dof(X), and equals1 for continuous variables and dX − 1 for categoricalvariables. Similarly, the degrees of freedom for a set ofvariables Z, is defined as Dof(Z) =

∑i Dof(Zi). We

note that we only consider linear models with interceptterms but without interaction terms, but everythingstated can be directly applied to models with interac-tion terms or non-linear terms.

2.3.1 Linear regression

Linear regression models can be used if Y is continu-ous. The number of parameters of the reduced modelM0 equals Par(M0) = Dof(Z) + 1, while for M1,Par(M1) = Dof(Z) + Dof(X) + 1. Typically, F-tests(which fall in the category of Wald tests) are used forlinear regression. The F statistic is computed as

F =(RSS0 −RSS1)(n−Par(M1))

RSS1(Par(M1)−Par(M0))

where RSS0 and RSS1 are the residual sum of squares ofmodelsM0 andM1 respectively, and n is the sample size.The F statistic, under the null hypothesis (the reducedmodel is the correct one) follows an F distribution with(Par(M1)−Par(M0), n−Par(M1)) degrees of freedom,which is asymptotically equivalent to a χ2 distributionwith v = Par(M1) − Par(M0) = Dof(X) degrees offreedom. Alternatively, if X is also continuous, only thefull model is required and a t-test on the coefficient ofX can be performed1.

2.3.2 Logistic regression

In case Y is nominal, a binary or multinomial logisticregression model can be used, while for ordinal Y or-dinal logistic regression is more appropriate. Typically,ordinal logistic regression makes the proportional oddsassumption (also known as the parallel regression as-sumption): all levels of the ordinal variable must havethe same slope, and the estimated constants are non-decreasing. The proportional ordered logit model that

1 In this case, the t-test (Wald test) is equivalent to theF-test (likelihood ratio test).


Y has a value larger than j given a set of predictors Xis

P (Y > j) =exp(aj +

∑i βiXi)

1 + exp(aj +∑

i βiXi)

Notice that the values of βi are the same for each cat-egory of Y (i.e., the log-odds functions for each classof Y are parallel). In practice, the proportional oddsassumption does not necessarily hold [2]. Because ofthat, we consider the generalized ordered logit model[43] hereafter, which does not make the proportionalodds assumption. The generalized ordered logit modelis

P (Y > j) =exp(aj +

∑i βi,jXi)

1 + exp(aj +∑

i βi,jXi)

where βi,j is the coefficient of the i-th variable Xi forthe j-the category of Y . [43] described a simple wayto fit this model. This is done by fitting a series ofbinary logistic regressions where the categories of Y arecombined. If there are M = 4 categories for example,then for j = 1 category 1 is contrasted with categories2, 3, and 4; for j = 2 the contrast is between categories1 and 2 versus 3 and 4; and for j = 3, it is categories 1,2, and 3 versus category 4.

Finally, for both multinomial and ordinal regression(using the generalized ordered logit model), the numberof parameters is Par(M0) = (dY − 1)(Dof(Z) + 1)

and Par(M1) = (dY − 1)(Dof(Z) +Dof(X) + 1), andthe likelihood-ratio test has (dY − 1)Dof(X) degrees iffreedom.

2.3.3 Limitations

We note that we implicitly assume that the assumptionsof the respective models hold. For instance, linear regres-sion assumes (among others) independent and Gaussianresiduals, homoscedasticity and that the outcome is alinear function of the model variables. The latter alsoapplies to logistic regression models, and specificallythat the log-odds ratio is a linear function of the vari-ables. If the model assumptions do not hold, the testsdo not follow the same asymptotic distribution, andthus may lead to different results. However, we notethat linear regression models are robust to deviationsfrom the assumption of normality of residuals, and toa smaller degree to deviations of the homoscedasticityassumption[26]. The latter could also be handled byusing tests based on robust regression.

Furthermore, we also note that even if the data comefrom a Bayesian network whose functional relations arelinear models as the ones considered above, there arecases where tests fail to identify certain dependencies.Consider for example a simple network consisting ofthree variables, X, Y and Z, where Y is nominal with 3

X ~ -Y1 + Y

2 + 0.1 * ε

X

-1.5 -1 -0.5 0 0.5 1 1.5

Z ~

Y1 +

Y2 +

0.1

* ε

Z

-0.5

0

0.5

1

1.5Joint distribution of X and Z (BN: X ← Y → Z)

Corr: 0.008, p-value: 0.795

Fig. 1 An example where the proposed tests fail to identifythe unconditional dependency between X and Z is shown.The correlation between X and Z is 0.008 and the p-value ofthe test equals 0.795, suggesting independence.

levels, X and Z are continuous and Y is a parent of Xand Z. Let Y be uniformly distributed, Yi denote the bi-nary variable corresponding to the i-th dummy variableof Y , X = −Y1 + Y2 + 0.1εX and Z = Y1 + Y2 + 0.1εZ ,where ε N(0, 1). Thus, the conditional distribution ofX and Z given Y is Gaussian, although their marginaldistribution is non-Gaussian. An example of the jointdistribution of X and Z with 1000 random samples isshown in Figure 1. Notice that Y induces a non-linearrelation between X and Z, even though all functionsare linear. Therefore, any test based on linear regressionmodels on X and Z (or equivalently Pearson correlation)will not identify the dependence between them, despitethem being unconditionally dependent. One approachto this problem is to use kernel-based tests (or other,non-linear tests), which would be able to identify such adependency asymptotically. We note that although indi-rect dependencies may be missed by the proposed tests,direct dependencies (edges) would still be identified.Thus, algorithms such as the conservative PC algorithm[32] that only rely on the adjacency faithfulness assump-tion (i.e., two adjacent variables are dependent givenany set of variables) could be used in conjunction withthose tests, and the results would be correct, althoughpossibly less informative.

3 Related Work

Mixed data have been considered in the context ofMarkov network learning; see [44] for a review of suchmethods. [17] were the first to propose a Bayesianmethod to score Bayesian networks with mixed cat-egorical and Gaussian variables. The score is derived


under the assumption that continuous variables withdiscrete parents follow a conditional Gaussian distri-bution, similar to the graphical models considered by[25]. An important drawback of this approach is that itdoes not allow discrete variables to have continuous par-ents, limiting its use in practice. A different approach isfollowed by [14, 30], who consider methods of discretiza-tion of continuous variables given a specific Bayesiannetworks structure. Such techniques can then be usedto search over both, a Bayesian network structure aswell as a discretization strategy. [28] propose a methodfor testing unconditional independence for continuousvariables which is also directly applicable to ordinal andnominal variables. The method has also been extendedto the conditional case, with a single variable in the con-ditioning set [27]. We are not aware of any extension tothe general case that considers larger conditioning sets.[4] propose a kernel-based method for graphical modellearning with mixed discrete and continuous variables,and show how both scores and conditional independencetests can be derived from it. Its main drawbacks are that(a) it has two hyper-parameters, which may be hard totune, and (b) that it is computationally demanding, hav-ing a time complexity of O(n3), where n is the samplesize, although approximations can be used that scalelinearly with sample size.

[10] suggested a copula-based method for performingconditional independence tests with mixed continuousand ordinal variables. The idea is to estimate the corre-lation matrix of all variables in latent space (containinglatent variables which are mapped to the observed vari-ables), which can then be directly used to computepartial correlations and perform independence tests. Tothis end, they employ Hoff’s Gibbs sampling schemefor generating covariance matrices using copula [18].The main disadvantage is that the correlation matrixis estimated using Gibbs sampling, and thus may becomputationally demanding and hard to estimate ac-curately. [23] build upon the work of [12], and proposehybrid copula Bayesian networks which can model bothdiscrete and continuous variables, as well as a methodfor scoring such networks.

Recently, [34] proposed to use likelihood-ratio testsbased on linear and logistic regression models to deriveconditional independence tests for mixed continuous andnominal variables. They suggest to use linear regressioninstead of logistic regression whenever applicable, asit is more accurate. This work is most closely relatedto our approach. The main differences are: (a) theyonly consider continuous and nominal variables, whereasour proposed approach is more general, and is able todeal with other variable types such as ordinal variables,(b) they do not address the asymmetry between both

directional tests, while we propose and evaluate methodsthat handle it.

4 Symmetric Conditional Independence Testsfor Mixed Data

We consider conditional independence tests based onnested likelihood-ratio tests, using linear, logistic, multi-nomial and ordinal regression to handle continuous,binary, nominal and ordinal variables respectively. Forall cases, we only consider models with linear terms,without any interactions, although this is not a limita-tion of the proposed approach and additional terms canbe included.

Let H0: P (X,Y |Z) = P (X|Z) · P (Y |Z) (X and Yare conditionally independent given Z), be the null hy-pothesis of conditional independence. Since we do nothave a direct way to test this hypothesis, we considerthe null hypotheses H1: P (X|Y,Z) = P (X|Z), and H2:P (Y |X,Z) = P (Y |Z). H1 can be tested using a nestedlikelihood ratio test by regressing against X, while H2

can be tested by flipping X and Y and regressing onY . For instance, if X is continuous and Y is nominal,one can either fit two linear regression models for Xto test H1, one using Y ∪ Z (full model) and one usingonly Y (reduced model) and perform an F-test, or tofit two multinomial logistic regression models for Y in asimilar fashion to test H2 and perform a likelihood-ratiotest. Ideally, both tests should give identical results,and thus be symmetric. There are special cases, suchas when X and Y are continuous and linear regressionmodels are used, where symmetry holds. Unfortunately,this does not necessarily hold in the general case. Tothe best of our knowledge, it is not known under whichconditions such tests are symmetric. Empirical evidence(see Section 5) suggests that tests using the aforemen-tioned models give the same results asymptotically (thiswas also mentioned in [34]). Therefore, given sufficientlymany samples, any one of the two tests can be used.For small sample settings however the test results of-ten differ, which motivated us to consider methods forderiving symmetric tests.

4.1 Symmetric tests by combining dependent p-values

One approach is to perform both tests and to combinethem appropriately. Let p1 and p2 be the p-values ofthe tests for H1 and H2 respectively. As both hypoth-esis tests essentially test the same hypothesis, one canexpect the p-values to be positively dependent. We usea method presented in [5] for combining dependent p-values (which we call MM hereafter), an extension of


a previous method [35]. The resulting p-value pmm iscomputed as

pmm = min {2min(p1, p2),max(p1, p2)} . (1)

This p-value can be used to assess whether at leastone of the two asymmetric null hypotheses can be re-jected. Moreover, it can be demonstrated that pmm istheoretically correct even in presence of specific typesof correlations among the two p-values, as in the caseof one-sided p-values based on Gaussian test statisticsthat are positively correlated [5]; whether this also holdsfor combining p-values stemming from tests consideredhere is not clear and needs further investigation, but itis nevertheless a useful heuristic. In addition to that,we considered two simple approaches, by taking theminimum or the maximum between the two p-values

pmin = min(p1, p2) or (2)pmax = max(p1, p2) (3)

The latter is identical to testing whether both hypothesescan be rejected, and is an instance of the method by[5] for combining dependent p-values. Although takingthe minimum p-value should be avoided for independentp-values, as it does not account for multiple testing, itmay be a reasonable choice if the p-values have a highpositive correlation.

There has been theoretical work for deriving thetrue distribution of the sum or the ratio of the two teststatistics, assuming their correlation is known [19, 20].A general, permutation-based method for estimating thecorrelation between test statistics has been proposedby [11]. It is computationally expensive, as it requiresfitting a large number of models, which is prohibitive forlearning graphical models.In anecdotal experiments wefound that this method and the ones considered aboveproduce similar results, and thus it was not furtherconsidered.

4.2 A strategy for prioritizing asymmetric tests

A different approach for deriving symmetric tests is touse a strategy to prioritize tests and to only perform oneof the two tests. This is especially attractive due to itslower computational cost, compared to the previouslydescribed approach. [34] compared tests based on lin-ear regression and multinomial logistic regression, andfound that linear regression is generally more accurate.This can be explained by the fact that the full linearregression model has fewer parameters to fit than thefull multinomial regression model (unless the variableis binary), and thus can be estimated more accuratelygiven the same amount of samples. Let X be a continu-ous variable and Y be a categorical (nominal or ordinal)

variable taking dY values. The number of parametersrequired by a full linear regression model for X using Yand Z equals Dof(Z) + (dY − 1) + 1 (see Section 2.3).The logistic regression model for Y on the other handrequires (dY − 1) · (Dof(Z) + 1 + 1) parameters. Thus,unless Y is binary and dY = 2, the logistic regressionmodel always contains more parameters. Everythingstated above also holds for the case of unconstrainedgeneralized ordinal regression models. Using this fact,and the observation made by [34], we propose to priori-tize tests as follows.

Priority: Continuous > Nominal > Ordinal

In case of two nominal or ordinal variables, the variablewith the fewer values is regressed on, while in case of ties,an arbitrary variable is picked. Note that, if the latterholds the proposed strategy is not always symmetric; weplan to address this case in future work. Recall that ifboth X and Y are continuous, the tests are symmetricand thus any one of them can be used. In anecdotalexperiments we observed that ordinal regression models,especially the ones considered here, are typically harderto fit than multinomial logistic models, which is thereason why we prioritize nominal over ordinal variables.Hereafter, we will refer to this approach as the Fastapproach.

Finally, we note that the problem of asymmetryhas been addressed before in different contexts. TheMMHC algorithm [39] for Bayesian network learningperforms feature selection for each variable using theMMPC algorithm to identify a set of candidate par-ents and children (neighbors), which may result in caseswhere a variable X is a neighbor of another variableY but the opposite does not hold. If this is the case,MMPC corrects the asymmetry by removing variable Xfrom the set of neighbors of Y . Similar, in the contextof Markov networks, Meinshausen and Bühlmann [29]consider adding an edge between two variables if theirneighbor sets contain each other (logical conjunction)or if at least one of the neighbor sets contains the other(logical disjunction), where neighbor sets are inferredindependently for each node. The authors state thatboth approaches perform similarly and are asymptoti-cally identical. Both methods use asymmetric tests toidentify the neighbors of each node, and then perform asymmetry correction. This approach is similar, althoughnot exactly the same, as taking the minimum (logicaldisjunction) or maximum (logical conjunction) p-value.Both are valid strategies, and should perform similarly(at least for large sample sizes) to the proposed ones.However, the proposed strategies are more general, thusapplicable with any method that uses conditional inde-


pendence tests. In Section 5 we also see that the MMtypically performs better than strategies based on takingthe minimum or maximum p-value.

4.3 Limitations

For certain variable types, such as longitudinal andcensored time-to-event data, it is not always possible toperform both tests. Unless bothX and Y are of the sametype (e.g. both longitudinal or censored time-to-event),it is not clear how to regress on the non time relatedvariable. For example, if X is a censored time-to-eventvariable (that is, a binary variable indicating whetheran event occurred, as well a continuous variable withthe time of the event), and Y is a continuous variable,it is straightforward to regress Y on X using methodssuch as Cox regression to perform a likelihood-ratiotest, while the opposite is harder to handle. We plan toinvestigate such variable types in the future.

5 Simulation Studies

We conducted experiments on simulated data to investi-gate the properties of mixed tests based on regressionmodels, and to evaluate the proposed symmetric tests.Afterwards, we compare the MM and Fast symmetrictests to a copula based approach (called Copula here-after) for mixed continuous and ordinal variables [10] inthe context of Bayesian network learning. The methodswere compared on synthetic Bayesian networks withcontinuous and ordinal variables.

5.1 Data generation

We proceed with a description of the data generationprocedure used throughout the experiments. We willdescribe the general case for data generation given aBayesian network structure G and the type of eachvariable. Let X be a variable in G and Pa(X) be theparents of X in G. For the moment, we will only considercontinuous and nominal variables; ordinal variables willbe treated separately afterwards. In all experiments,nominal and ordinal variables take up to 4 values.

In case Pa(X) is empty, X is sampled from thestandard normal distribution if it is continuous, andis uniformly distributed in case it is binary/nominal.If Pa(X) is not empty, then X = f(Pa(X)) = f(b0 +∑

i biPai(X) + εX), which is a linear or generalizedlinear function depending on the type of X. Althoughnot shown above, as before all nominal variables are

transformed into dummy variables, and thus a coeffi-cient is assigned to each dummy variable. The followingprocedure is used to generate data for X.

1. Generate samples for each variable in Pa(X) re-cursively, until samples for each variable are avail-able

2. Sample the coefficients b of f(Pa(X)) uniformlyat random from [−1,−0.1] ∪ [0.1, 1]

3. Generate εX ∼ N(0, 1)

4. Compute X using f(Pa(X))

In order to generate ordinal variables, we first gener-ated a continuous variable as described above, and thendiscretized it into 2-4 categories appropriately (withoutdamaging the ordered property). Each category containsat least 15% of the observations, while the remainingones are randomly allocated to all categories. This isidentical to having a latent continuous variable (the onegenerated), but observing a discretized proxy of it. Notethat, as the discretization is random, any normality ofthe input continuous variable is not preserved. Finally,ordinal variables in the parent sets are not treated asnominal variables, but simply as continuous ones andthus only one coefficient is used for them for the purposeof data generation.

5.2 Investigating the properties of mixed tests

We considered 5 combinations of variable types andcorresponding regression models: (a) linear-binary (L-B), (b) linear -multinomial (L-M), (c) linear-ordinal(L-O), (d) binary-ordinal (B-O), and (e) multinomial-ordinal (M-O). For each case, we considered the followingsimple Bayesian network models: (a) X Y (unconditionalindependence), (b) X → Y and X ← Y (unconditionaldependence), (c) X → Z ← Y (conditional dependenceof X and Y given Z), also known as collider [37], and(d) X ← Z → Y (conditional independence of X andY given Z). In all cases, Z is continuous. We used thepreviously described procedure to generate data forthose networks. The sample size varied in (20, 50, 100,200, 500, 1000), and each experiment was performed1000 times, except for case (b), which was performed1000 times for each direction.

Figure 2 shows the correlation and decision agree-ments (reject or not the null hypothesis) at the 5%significance level (similar results hold true for the 0.1%,1% and 10% significance levels) between all 5 pairs ofregression models. For the unconditional dependencecase, in which both directional models were considered,we repeated the experiment twice and report averagesover both cases. We did not consider the correlation


(a) Unconditional Independence

(b) Conditional Independence

(c) Unconditional dependence (d) Conditional dependence

Fig. 2 The correlation of the two p-values and the proportionof decision agreements at the 5% significance level are shownfor different pairs of regression models. The correlation of p-values for (un)conditional independence increase with samplesize, reaching almost perfect positive correlation in most cases.In terms of decision agreements, an agreement of over 90% isreached in all cases even with 200 samples.

of p-values in dependent cases, as one is typically in-terested to have low enough p-values to reject the nullhypothesis. Overall, the correlation between both testsis very high, and tends to one with increasing samplesize. An exception is the multinomial-ordinal (M-O)conditional independence case, whose correlation is no-ticeably smaller than the rest. This can be explainedby the fact that this test is the hardest one, as eithertest uses models with 15 parameters to be fit, requiringmore samples. The proportion of decision agreements isvery high for all pairs, reaching over 90% even with 200samples. This is very encouraging, as this is the mostimportant factor for causal discovery methods.

Figures 3 and 4 show the estimated type I errorand power of all methods. In the unconditional cases,

as well as in most conditional cases all methods per-form similarly. Whenever linear models are involved, theasymmetric linear test and the symmetric MM methodoutperform the rest, which can be seen mostly in thetype I error on the conditional independence case. Forthe conditional case of binary-ordinal and multinomial-ordinal pairs, the MM method offers the best trade-offbetween type I error, as it it very close to 5%, andpower, being only slightly worse than some competitorsfor small samples. Asymmetric tests based on ordinalregression break down in the conditional cases for smallsample sizes, and symmetric methods like MM shouldbe preferred.

5.3 Evaluation on Bayesian network learning

As shown in the previous section, the best performingmethod is the MM method, while the proposed asym-metric approach seems to be promising if continuousvariables are involved. In this section, we use those meth-ods for Bayesian network learning. We compare them toa recent method by [10], which is applicable to continu-ous, binary and ordinal variables. As a Bayesian networklearning algorithm we used the order-independent PCalgorithm [21], as implemented in the R package pcalg[22]. The significance level was set to 0.01 for all exper-iments. For the Copula method, [10] used 20 burn-insamples, and 80 samples to estimate the correlation ma-trix using Gibbs sampling. We increased these numbersto further improve its accuracy. Specifically, we used 2p

burn-in samples and 4p samples to estimate the correla-tion matrix, where p is the number of variables in thedata.

We generated Bayesian networks with 50 and 100variables, and with an average degree of 3 and 5. Foreach case, we generated 50 random Bayesian networks,and sampled 200, 500 and 1000 training instances. Intotal, this amounts to 600 datasets. Each variable hasa 50% probability of being continuous or ordinal, andordinal variables take 2-4 values with equal probability.The sampling of the network parameters and the datageneration were performed as described above.

To evaluate the performance of the different meth-ods, we computed the structural Hamming distance(SHD) [39], as well as the precision and recall of the net-work structure and orientations. Naturally, all metricswere computed on the estimated PDAG and the truePDAG. Precision and recall are proportions, hence, inorder to compare their values we used the t-test appliedon log a

b , where a is the precision (recall) of the MMmethod, and b is the corresponding precision (recall) of


Linear-Binary Linear-Multinomial Linear-Ordinal Binary-Ordinal Multinomial-Ordinal

Unconditional Independence

Conditional Independence

Fig. 3 Estimated type I error on the (un)conditional independence cases for each pair of regression models, and three methodsfor combining dependent p-values. The solid horizontal line is at the 5% level and the two dashed lines at 4% and 6% levels.Whenever linear regression models are involved, the MM method and the linear test perform similarly. For the conditional caseof binary-ordinal and multinomial-ordinal pairs, the MM method outperforms all methods.

Linear-Binary Linear-Multinomial Linear-Ordinal Binary-Ordinal Multinomial-Ordinal

Unconditional Dependence

Conditional Dependence

Fig. 4 Estimated power on the (un)conditional dependence cases for each pair of regression models, and three methods forcombining dependent p-values. In most cases, all methods perform very similar. For the multinomial-ordinal case ordinalregression breaks down for small samples, and MM is slightly behind the rest. This is expected, as the other methods also havea larger type I error.

the competing methods2. As for the SHD, we took the

2 The reason of this is that, these measure being proportions,the absolute difference does not reflect the same information

as the ratio which is more meaningful. The difference, forexample, between 0.2 and 0.1 is the same as that of 0.8 and0.7, but their ratio is clearly not the same.


differences between the MM method and the rest. Since,the values of the SHD are discrete, we used the Skellamdistribution [36] and tested (via a likelihood ratio test)whether its two parameters are equal, implying that thecompared values are equal. A t-test could be appliedhere as well, but in order to be more exact we used atest (or distribution) designed for discrete data.

The results are summarized in Tables 1, 2 and 3.Each table contains average values over 50 randomBayesian networks. Overall, the proposed MM approachstatistically significantly outperforms the other methodsacross all computed performance metrics. The Copulamethod outperforms MM in terms of both predictionmetrics only in the 50 variable case with average degree3 and 200 samples, while the Fast approach is alwaysinferior to MM, and is often comparable to Copula. Fur-thermore, both MM and Fast improve across all metricswith increasing sample size. Copula however does notalways do so, and precision often declines with increas-ing sample size (for example, see cases with 50 variablenetworks).

6 Conclusions

In this paper a general method for conditional inde-pendence testing on mixed data is proposed, such asmixtures of continuous, nominal and ordinal variables,using likelihood ratio tests based on regression mod-els. Likelihood-ratio tests are not necessarily symmetric,and different approaches to derive symmetric tests areconsidered. In simulated experiments, it is shown thatthe likelihood-ratio tests considered in this paper areasymptotically symmetric. Furthermore, the proposedsymmetric MM test is shown to significantly outperformcompeting methods in Bayesian network learning tasks.

Acknowledgements

We would like to thank the anonymous reviewers fortheir helpful and constructive comments. The researchleading to these results has received funding from theEuropean Research Council under the European Union’sSeventh Framework Programme (FP/2007-2013) / ERCGrant Agreement n. 617393.

References

1. Agresti A (2002) Categorical Data Analysis, 2nd edn. Wi-ley Series in Probability and Statistics, Wiley-Interscience

2. Agresti A (2010) Analysis of ordinal categorical data.John Wiley & Sons

3. Baba K, Shibata R, Sibuya M (2004) Partial correlationand conditional correlation as measures of conditional in-dependence. Australian & New Zealand Journal of Statis-tics 46(4):657–664

4. Bach FR, Jordan MI (2002) Learning graphical modelswith Mercer kernels. In: NIPS, vol 15, pp 1009–1016

5. Benjamini Y, Heller R (2008) Screening for partial con-junction hypotheses. Biometrics 64(4):1215–1222

6. Bromberg F, Margaritis D, Honavar V (2009) EfficientMarkov network structure discovery using independencetests. Journal of Artificial Intelligence Research 35:449–484

7. Chickering DM (2002) Optimal structure identificationwith greedy search. Journal of machine learning research3(Nov):507–554

8. Christensen R (2011) Plane answers to complex questions:the theory of linear models. Springer Science & BusinessMedia

9. Cooper GF, Herskovits E (1992) A Bayesian method forthe induction of probabilistic networks from data. Machinelearning 9(4):309–347

10. Cui R, Groot P, Heskes T (2016) Copula PC Algorithmfor Causal Discovery from Mixed Data. In: Joint Euro-pean Conference on Machine Learning and KnowledgeDiscovery in Databases, Springer, pp 377–392

11. Dai H, Leeder JS, Cui Y (2014) A modified generalizedFisher method for combining probabilities from dependenttests. Frontiers in genetics 5(32)

12. Elidan G (2010) Copula Bayesian networks. In: Advancesin neural information processing systems, pp 559–567

13. Foutz RV, Srivastava RC (1977) The performance of thelikelihood ratio test when the model is incorrect. TheAnnals of Statistics 5(6):1183–1194

14. Friedman N, Goldszmidt M, et al (1996) Discretizingcontinuous attributes while learning bayesian networks.In: ICML, pp 157–165

15. Geiger D, Heckerman D (1994) Learning gaussian net-works. In: Proceedings of the 10th international conferenceon Uncertainty in Artificial Intelligence, Morgan Kauf-mann Publishers Inc., pp 235–243

16. Heckerman D, Geiger D (1995) Learning bayesian net-works: a unification for discrete and Gaussian domains. In:Proceedings of the Eleventh conference on Uncertainty inartificial intelligence, Morgan Kaufmann Publishers Inc.,pp 274–284

17. Heckerman D, Geiger D, Chickering DM (1995) LearningBayesian networks: The combination of knowledge andstatistical data. Machine learning 20(3):197–243

18. Hoff PD (2007) Extending the rank likelihood for semi-parametric copula estimation. The Annals of AppliedStatistics pp 265–283

19. Joarder AH (2009) Moments of the product and ratioof two correlated chi-square variables. Statistical Papers50(3):581–592

20. Joarder AH, Omar MH, Gupta AK (2013) The distribu-tion of a linear combination of two correlated chi-squarevariables. Revista Colombiana de Estadística 36:211–221

21. Kalisch M, Bühlmann P (2007) Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research8(Mar):613–636

22. Kalisch M, Mächler M, Colombo D, Maathuis MH,Bühlmann P, et al (2012) Causal inference using graphicalmodels with the R package pcalg. Journal of StatisticalSoftware 47(11):1–26


Table 1 Precision and recall for the skeleton estimation. An asterisk (*) indicates that the precision or recall of the Fast orCopula approach is statistically significantly lower than that of MM at the 1% significance level. The italic font indicates thatthe precision of the Copula approach is statistically significantly higher than that of MM at 1% significance level.

Skeleton Precision

3 neighbors 50 variables 100 variables

Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 0.783 0.981 0.988 0.949 0.971 0.974Fast 0.708* 0.971* 0.979* 0.936* 0.952* 0.951*Copula 0.898 0.942* 0.975 0.884* 0.896 0.914*


Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 0.989 0.992 0.993 0.988 0.992 0.989Fast 0.986 0.990 0.992 0.984 0.985* 0.985*Copula 0.980* 0.971* 0.951* 0.987 0.961* 0.950*

Skeleton Recall


Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 0.172 0.704 0.808 0.536 0.707* 0.794Fast 0.155* 0.639* 0.711* 0.507* 0.643* 0.684*Copula 0.152* 0.675 0.796 0.402* 0.669* 0.793


Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 0.445 0.617 0.717 0.460 0.624 0.725Fast 0.436* 0.575* 0.649* 0.457 0.582* 0.660*Copula 0.374* 0.600* 0.700 0.341* 0.595* 0.725

23. Karra K, Mili L (2016) Hybrid copula Bayesian Networks.In: Proceedings of the Eighth International Conferenceon Probabilistic Graphical Models, pp 240–251

24. Lagani V, Athineou G, Farcomeni A, Tsagris M, Tsamardi-nos I (2016) Feature selection with the R Package MXM:Discovering statistically-equivalent feature subsets. arXivpreprint arXiv:161103227

25. Lauritzen SL, Wermuth N (1989) Graphical models forassociations between variables, some of which are qualita-tive and some quantitative. The annals of Statistics pp31–57

26. Lumley T, Diehr P, Emerson S, Chen L (2002) The impor-tance of the normality assumption in large public healthdata sets. Annual review of public health 23(1):151–169

27. Margaritis D (2005) Distribution-free learning of bayesiannetwork structure in continuous domains

28. Margaritis D, Thrun S (2001) A bayesian multiresolutionindependence test for continuous variables. In: Proceed-ings of the Seventeenth conference on Uncertainty inartificial intelligence, Morgan Kaufmann Publishers Inc.,pp 346–353

29. Meinshausen N, Bühlmann P (2006) High-dimensionalgraphs and variable selection with the lasso. The annalsof statistics pp 1436–1462

30. Monti S, Cooper GF (1998) A multivariate discretiza-tion method for learning Bayesian networks from mixeddata. In: Proceedings of the Fourteenth conference onUncertainty in artificial intelligence, Morgan KaufmannPublishers Inc., pp 404–413

31. Pearl J (1988) Probabilistic reasoning in intelligent sys-tems: Networks of plausible reasoning. Morgan KaufmannPublishers, Los Altos

32. Ramsey J, Spirtes P, Zhang J (2006) Adjacency-faithfulness and conservative causal inference. In: Pro-ceedings of the Twenty-Second Conference on Uncertaintyin Artificial Intelligence, AUAI Press, pp 401–408

33. Richardson T, Spirtes P (2002) Ancestral graph Markovmodels. Annals of Statistics pp 962–1030

34. Sedgewick AJ, Ramsey JD, Spirtes P, Glymour C, BenosPV (2017) Mixed Graphical Models for Causal Analysis ofMulti-modal Variables. arXiv preprint arXiv:170402621

35. Simes RJ (1986) An improved bonferroni procedure formultiple tests of significance. Biometrika pp 751–754

36. Skellam JG (1946) The frequency distribution of the differ-ence between two Poisson variates belonging to differentpopulations. Journal of the Royal Statistical Society SeriesA (General) 109(Pt 3):296–296

37. Spirtes P, Glymour CN, Scheines R (2000) Causation,Prediction, and Search. MIT press

38. Tsamardinos I, Aliferis CF, Statnikov A (2003) Timeand sample efficient discovery of Markov blankets anddirect causal relations. In: Proceedings of the ninth ACMSIGKDD international conference on Knowledge discoveryand data mining, ACM, pp 673–678

39. Tsamardinos I, Brown LE, Aliferis CF (2006) The max-min hill-climbing Bayesian network structure learningalgorithm. Machine learning 65(1):31–78


Table 2 Precision and recall for the estimation of the orientations. An asterisk (*) indicates that the precision or recall ofthe Fast or Copula approach is statistically significantly lower than that of MM at the 1% significance level. The italic fontindicates that the precision of the Copula approach is statistically significantly higher than that of MM at 1% significance level.

Orientation Precision


Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 0.686 0.979 0.988 0.943 0.965 0.974Fast 0.608* 0.969* 0.978* 0.928* 0.942* 0.948*Copula 0.812 0.940* 0.928* 0.976 0.932* 0.913*


Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 0.987 0.992 0.993 0.986 0.992 0.989Fast 0.984 0.989 0.992 0.982 0.985* 0.984*Copula 0.975* 0.970* 0.950* 0.984 0.959* 0.949*

Orientation Recall


Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 0.118 0.692 0.806 0.504 0.668 0.790Fast 0.108* 0.621* 0.698* 0.476* 0.600* 0.669*Copula 0.092* 0.666 0.793 0.342* 0.625* 0.791


Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 0.413 0.606 0.711 0.430 0.613 0.719Fast 0.406* 0.561* 0.638* 0.428 0.569* 0.649*Copula 0.327* 0.591 0.696 0.289* 0.583* 0.722

Table 3 Structural Hamming distance (lower is better). An asterisk (*) indicates that the SHD of the Fast or Copula approachis statistically significantly higher than that of MM at the 1% significance level.

Structural Hamming Distance3 neighbors 50 variables 100 variables

Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 71.48 34.40 25.64 97.62 69.94 55.12Fast 73.18* 38.66* 33.76* 101.04* 80.46* 74.28*Copula 70.96 37.12* 30.30* 115.66* 76.88* 62.00*


Method n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000

MM 81.46 57.28 44.54 158.60 112.35 87.10Fast 82.12 62.62* 53.56* 158.50 123.55* 105.30*Copula 91.60* 60.42* 49.84* 191.40* 124.95* 93.15*

40. Vuong QH (1989) Likelihood ratio tests for model selectionand non-nested hypotheses. Econometrica: Journal of theEconometric Society pp 307–333

41. White H (1982) Maximum likelihood estimation of mis-specified models. Econometrica: Journal of the Economet-ric Society pp 1–25

42. Wilks SS (1938) The large-sample distribution of thelikelihood ratio for testing composite hypotheses. TheAnnals of Mathematical Statistics 9(1):60–62

43. Williams R (2006) Generalized ordered logit/partial pro-portional odds models for ordinal dependent variables.Stata Journal 6(1):58

44. Yang E, Baker Y, Ravikumar P, Allen G, Liu Z (2014)Mixed graphical models via exponential families. In: Arti-ficial Intelligence and Statistics, pp 1042–1050

45. Zhang J (2008) On the completeness of orientation rulesfor causal discovery in the presence of latent confoundersand selection bias. Artificial Intelligence 172(16):1873–


189646. Zhang K, Peters J, Janzing D, Schölkopf B (2012) Kernel-

based conditional independence test and application incausal discovery. In: Proceedings of the 27th conferenceon Uncertainty in Artificial Intelligence, pp 804–813

Constraint-based Causal Discovery with Mixed...

Documents

Transcript of Constraint-based Causal Discovery with Mixed...