LEARNING HIGH-DIMENSIONAL GRAPHICAL …ufdcimages.uflib.ufl.edu/UF/E0/05/12/88/00001/XU_S.pdfLIST OF...

LEARNING HIGH-DIMENSIONAL GRAPHICAL MODELS FOR GENERAL TYPES OFRANDOM VARIABLES

By

SUWA XU

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2017

c⃝ 2017 Suwa Xu

I dedicate this to everyone that helped me

ACKNOWLEDGMENTS

Completing this dissertation would not have been possible without the support from the

people that have helped me remain focused, motivated and inspired throughout the years. I am

extremely fortunate to be surrounded by such amazing people.

First I would like to thank my advisor, Professor Faming Liang, for his support throughout

the duration of my time as a graduate student at UF. His wisdom and generosity will always

inspire me. Also, his cheerfulness and encouragement have been essential in giving me the

space that allowed me to discover myself in the field of biostatistics.

I owe thanks to all of my committee members, Professor Yang Yang, Professor Fei Zou

and Professor Samuel Wong for their useful and constructive comments and advice.

I would also like to thank my friends and fellow graduate students at UF for their

company. I cannot imagine what my life is going to be without them.

Last but not least I would like to thank my family, my Mom, Dad and grandparents.

Without their support, I would never been here for my PhD study.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.1.1 Introduction to Graphical Models . . . . . . . . . . . . . . . . . . . . 111.1.2 Graph Notation and Terminology . . . . . . . . . . . . . . . . . . . . 121.1.3 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2 Markov Network and Markov Properties . . . . . . . . . . . . . . . . . . . . . 141.3 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.1 Introduction to Bayesian Network . . . . . . . . . . . . . . . . . . . . 151.3.2 Constraint-based Approaches . . . . . . . . . . . . . . . . . . . . . . . 201.3.3 Score-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 221.3.4 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.4 ψ-learning Algorithm for Learning Gaussian Graphical Models . . . . . . . . . 26

2 UNDIRECTED GRAPHICAL MODEL FOR COUNT DATA . . . . . . . . . . . . . 32

2.1 RNA-seq Data and Poisson Graphical Models . . . . . . . . . . . . . . . . . . 322.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.1 Data-Continuized Transformation . . . . . . . . . . . . . . . . . . . . 342.2.2 Data Gaussianized Transformation . . . . . . . . . . . . . . . . . . . . 362.2.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.4 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4.1 Liver Cytochrome P450s Subnetwork . . . . . . . . . . . . . . . . . . 432.4.2 Acute Myeloid Leukemia mRNA Sequencing Network . . . . . . . . . 45

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3 BAYESIAN NETWORKS FOR MIXED DATA . . . . . . . . . . . . . . . . . . . . 51

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 A Brief Review of Bayesian Network Theory . . . . . . . . . . . . . . . . . . 523.3 Learning High-Dimensional Bayesian Networks . . . . . . . . . . . . . . . . . 54

3.3.1 Learning the Moral Graph . . . . . . . . . . . . . . . . . . . . . . . . 543.3.2 Identifying v -structures . . . . . . . . . . . . . . . . . . . . . . . . . . 583.3.3 Identifying Derived Directions . . . . . . . . . . . . . . . . . . . . . . 60

5

3.3.4 Consistency of the Proposed Method . . . . . . . . . . . . . . . . . . 623.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4.1 Mixed Data for an Undirected Graph . . . . . . . . . . . . . . . . . . 643.4.2 Mixed Data for a Directed Graph . . . . . . . . . . . . . . . . . . . . 66

3.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.5.1 Lung Cancer Genetic Network . . . . . . . . . . . . . . . . . . . . . . 703.5.2 Glioblastoma Genetic Network with Methylation Adjustment . . . . . . 74

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4 CONCLUSIONS AND FUTURE RESEARCH . . . . . . . . . . . . . . . . . . . . . 79

APPENDIX

A CONSISTENCY OF TRANSFORMATION-BASED METHOD . . . . . . . . . . . . 81

A.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81A.2 Existing Theory of Adaptive MCMC . . . . . . . . . . . . . . . . . . . . . . . 81A.3 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

B CONSISTENCY OF PROPOSED THREE-STAGE METHOD . . . . . . . . . . . . 87

B.1 Consistency of Moral Graph Learning . . . . . . . . . . . . . . . . . . . . . . 87B.2 Consistency of v -structure Identification . . . . . . . . . . . . . . . . . . . . 93

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6

LIST OF TABLES

Table page

1-1 Conditional independences represented by Markov networks. . . . . . . . . . . . . . 16

2-1 The posterior mean and standard deviation of αi , βi and θij for one simulated variable,

where a1 = a2 = a and b(0)1 = b

(0)2 = b

(0) . . . . . . . . . . . . . . . . . . . . . . 42

3-1 Outcomes of binary decision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3-2 Average areas under the Precision-Recall curves produced by the three-stage andpseudo-likelihood methods. The number in parenthese represents the standard deviationof the areas averaged over 10 datasets. . . . . . . . . . . . . . . . . . . . . . . . . 65

3-3 Average precision and recall of the directed graphs produced by three-stage, PC,HC and MMHC algorithms. The number in parentheses represents the standard deviationsof the value averaged over 10 datasets. . . . . . . . . . . . . . . . . . . . . . . . . 68

7

LIST OF FIGURES

Figure page

1-1 An example of a graphical model. Each arrow indicates a dependency. In this example:I depends on J, J depends on I, J depends on K and K depends on I . . . . . . . . 11

1-2 Illustrative plot for calculation ψ-partial correlation coefficients, where the solid anddotted edges indicate the direct and indirect associations, respectively. The left andright shaded ellipses cover, respectively, the reduced neighborhoods of node i andnode j in the correlation graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2-1 Left: Scatter plot of the continuized data versus raw counts for one variable. Right:QQ-plot of the Gaussianized data for one continuized variable. . . . . . . . . . . . . 41

2-2 Precision-recall curved produced by the proposed method (Cont+NPN+ψ-learning),log-transformation-based (Log+NPN+ψ-learning), log transformation-based gLasso(log+NPN+gLasso), log transformation-based nodewise regression (log+NPN+nodewiseregression), LPGM, SPGM, TPGM for the simulated data with n, p = (100, 200). . 41

2-3 Precision-recall curves of each method for different type of structures with (n, p) =(100, 200). Upper left: hub; upper right: scale-free; lower left: small-world; lowerright: random. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2-4 Precision-recall curves of each method for different type of structures with (n, p) =(500, 200). Upper left: hub; upper right: scale-free; lower left: small-world; lowerright: random. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2-5 Left: P450 gene regulatory subnetwork produced from Yang et al. (2010), wherethe known regulators and P450 genes are shown as blue rectangles and red ovals,respectively. Right: the subnetwork produced by the proposed method. . . . . . . . 46

2-6 GRN produced by the proposed method for the AML RNA-seq data with (n, p) =(179, 500). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2-7 Log-log plots of the degree distributions of the four networks generated by the proposedmethod (upper left), gLasso (upper right), nodewise regression (lower left), and LPGM(lower right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3-1 BMP format drawing. Note: no filetype is designated by adding an extension. . . . 61

3-2 A smaller version of the graph structure underlying the simulation study, where thecircle nodes represent Gaussian variables, the square nodes represent Bernoulli variables,and the solid, dotted and dashed lines represent three different types if edges. . . . . 64

3-3 Precision-Recall curves produced by the three-stage and pseudo-likelihood methodfor two mixed datasets: the left generated under the setting (n, pc , pd) = (500, 100, 100)and the right under the setting (n, pc , pd) = (100, 100, 100). . . . . . . . . . . . . 65

3-4 The true directed network for a dataset with n = 3000 samples. . . . . . . . . . . . 69

8

3-5 The estimated directed network for a dataset with n = 3000 samples. . . . . . . . . 70

3-6 The Bayesian network produced by the three-stage method with the mRNA (circlenodes) and mutation (square nodes) data measured on the same set of 121 LUSC(Lung Squamous Cell Carcinoma) . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3-7 The Bayesian network produced by MMHC with the mRNA (circle nodes) and mutation(square nodes) data measured on the same set of 121 LUSC (Lung Squamous CellCarcinoma) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3-8 The Bayesian network produced by HC with the mRNA (circle nodes) and mutation(square nodes) data measured on the same set of 121 LUSC (Lung Squamous CellCarcinoma) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3-9 Directed Glioblastoma genetic network learned by the three-stage method with methylationeffects having been adjusted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

LEARNING HIGH-DIMENSIONAL GRAPHICAL MODELS FOR GENERAL TYPES OFRANDOM VARIABLES

By

Suwa Xu

August 2017

Chair: Faming LiangMajor: Biostatistics

Graphical models have recently become a popular tool to study conditional independence

relationships among a large number of variables. There are two branches of graphical models,

Bayesian networks and Markov networks, among which Bayesian networks are directed acyclic

graphs while Markov networks do not contain direction information. We consider to learn

associations for general types of random variables in this thesis, e.g., count data and mixed

data. The existing method dealing with count data is Poisson graphical model. However

it is not consistent and can only infer certain local structures of the network. Meanwhile,

the network structure is difficult to learn when the number of variable p is greater than the

sample size n. Moreoever, in practice, the existing methods dealing with Bayesian network

work primarily with discrete data sets. The contribution of this thesis include a transformation

based algorithm for constructing networks from count data and a three-stage method for

learning Bayesian network from mixed data under the small-n-large-p scenario. The numerical

results indicate that the proposed methods significantly outperform the existing methods. The

proposed methods are feasible to construct genetic network on different types of genomic data,

such as microarry, RNA-seq and mutation data.

10

CHAPTER 1INTRODUCTION

1.1 Graphical Models

1.1.1 Introduction to Graphical Models

Graphical models have recently become a popular tool to study associations networks

for a large number of variables, where the variables can refer to genes, proteins, SNPs, or any

other subjects depending on the problem under study. Generally speaking, graphical models

use a graph-based representation as the foundation for encoding a complete distribution over

a multi-dimensional space and a graph that is a compact or factorized representation of a set

of independences that hold in in the specific distribution. One can think of Graphical models

as a marriage between graph theory and probability theory. Figure 1.1.1 shows an example of

graphical model.

A graph G consists a set of vertices V and a set of edges E joining some pairs of the

vertices. In graphical model, each vertex represents a random variable, and the graph gives a

visual representation of the joint distribution of the entire set of random variables. If the graph

has only undirected edges it is an undirected graph, also known as Markov random field or

Markov network. In these graphs, the absence of an edge between two vertices are conditionally

independent, given other variables. If all edges are directed, the graph is said to be directed.

There is an active literature on directed graphical models or Bayesian networks; these are

graphical models in which the edges have directional arrows (but no directed cycles). Directed

Figure 1-1. An example of a graphical model. Each arrow indicates a dependency. In thisexample: I depends on J, J depends on I, J depends on K and K depends on I

11

graphical models represent probability distributions that can be factored into products of

conditional distributions, and have the potential for casual inference. Both families of graphical

models encompass the properties of factorization and independences, but they differ in the set

of independences they can encode and factorization of the distribution that they induce.

1.1.2 Graph Notation and Terminology

A graph is called a complete graph if each pair of vertices is connected by an edge.

A subset is complete if it induces a complete subgraph. If there is an arrow from vertex u

pointing towards vertex v , u is said to be a parent of v and v a child of u. The set of parents

of v is denoted as Pa(v) and the set of children of u as Pa(u).

Two vertices u and v are called adjacent if there is an edge joining them; this is denoted

by u ∼ v . If there is no edge between u and v , i.e., u ∼ v , then u and v are said to be

non-adjacent. The set of neighbors of vertex u is denoted as ne(u). A path < v1, ..., vn >

from v1 to vn of an undirected graph, G = (V ,E), is blocked by a set S ⊆ V if v2, ..., vn−1∩

S = ∅. There is similar concept for paths of acyclic, directed graph, but the definition is

based on d-separation (will discuss later). A graph G = (V ,E) is connected if for any pair

u, v ⊂ V , there is a path u, ..., v in G. A connected graph G = (V ,E) is a tree if for any

pair u, v ⊂ V , there is a unique path < u, ..., v > in G.

A cycle is a path, < u, ..., v >, of length greater than two with the exception that u = v ;

a directed cycle is defined in the obvious way. A directed graph with no directed cycles is called

an acyclic, directed graph or simply a DAG.

1.1.3 Conditional Independence

If X , Y , Z are random variables with a joint distribution P, we say that X is conditionally

independent of Y given Z under P, and write X ⊥ Y |Z [P], if, for any measurable set A in the

sample space of X , there exists a version of the conditional probability P(A|Y ,Z) which is a

function of Z alone. Usually P will be fixed and omitted from the notation.

12

When X , Y and Z are discrete random variables the condition for X ⊥ Y |Z can be

written as

P(X = x ,Y = y |Z = z) = P(X = x |Z = z)P(Y = y |Z = z),

where the equation holds for all z with P(Z = z) > 0. When the three variables admit a joint

density with respect to a product measure µ, we have

X ⊥ Y |Z ⇔ fXY |Z(x , y |z) = fX |Z(x |z)fY |Z(y |z),

where this equation is to hold almost surely with respect to P.

The conditional relation X ⊥ Y |Z has the following properties, where f denotes an

arbitrary measurable function on the sample space of X :

(a) if X ⊥ Y |Z then Y ⊥ X |Z ;

(b) if X ⊥ Y |Z and U = f (X ), then U ⊥ Y |Z ;

(c) if X ⊥ Y |Z and U = f (X ), then X ⊥ Y |(Z ,U);

(d) if X ⊥ Y |Z and X ⊥W |(Y ,Z), then X ⊥ (W ,Y )|Z .

Another property of the conditional independence relation is,

(e) if X ⊥ Y |Z and X ⊥ Z |Y then X ⊥ (Y ,Z).

However the above property does not hold in general, but only under additional conditions

which states that the joint density of all variables with respect to a product measure is positive

and continuous.

A semi-graphoid is an algebraic structure which satisfies (a)-(d) where X , Y , Z are

disjoint subsets of a finite set and U = f (X ) is replaced by U ⊂ X (Pearl, 2014). If property

(e) also holds for disjoint subsets, it is called a graphoid.

A very important example of a model for the conditional independence properties is that

of graph separation in an undirected graphs. Let A, B and C be subsets for the vertex set V

of a finite undirected graph G = (V ,E). Define

A ⊥g B|C ⇔ C separates A from B in G.

13

The graph separation has the following properties:

(a*) if A ⊥g B|C then B ⊥ A|C ;

(b*) if A ⊥g B|C and U is a subset of A, then U ⊥ BC ;

(c*) if A ⊥g B|C and U is a subset of B, then A ⊥g B|(C ∪ U);

(d*) if A ⊥g B|C and A ⊥g D|(B ∪ C), then A ⊥g (B ∪D)|C .

Even the analogue of (d) holds when all involved subsets are disjoint. Therefore graph

separation satisfies the graphoid axioms.

1.2 Markov Network and Markov Properties

Markov network is also known as Markov random field, is a model over undirected graph.

Associated with an undirected graph G = (V ,E) and a collection of random variable

(Xi)i∈V there are three different types of Markov properties.

A probability measure P on X is said to obey

(P) the pairwise Markov property, relative to G, if for any pair (Xi ,Xj) of non-adjacentverticesXi ⊥ Xj |XV \i ,j

(L) the local Markov property, relative to G, if for any vertex Xi ∈ VXi ⊥ XV \ne(i)∪i|Xne(i), where ne(i) is the set of neighbors of i .

(G) the global Markov property, relative to G, if for any triplet (A,B,S) of disjoint subsetsof V such that S separates A from B in GA ⊥ B|S .

The above three Markov properties are not equivalent: The global Markov property is

stronger than the local Markov property, which in turn is stronger than the Pairwise one. If

it holds for all disjoint subset A, B, C and D that if A ⊥ B|(C ∪ D) and A ⊥ C |(B ∪ D)

then A ⊥ (B ∪ C)|D, then the Markov properties are all equivalent. This condition is an

analogue to (e) and holds, for example, if P has a positive and continuous density with respect

to a product measure µ, that is, the set of graphs with associated probability distributions that

satisfy the pairwise, local and global Markov properties are the same. For example, for the

undirected graphical model or Markov network, A − B − C − D . A ⊥ D|(B ∪ C)

14

since there is no edge between A and D. But B also separates A from C and D and therefore

by the global Markov property we conclude that A ⊥ C |B and A ⊥ D|B. Similarly we have

B ⊥ D|C .

The global Markov property allows us to decompose graphs into smaller pieces. In this

case, the graph is easier to manage. For this purpose, we separate the graph into cliques.

A clique is a complete subgraph. It is maximal if it is a clique and no other vertices can be

added to it and still yield clique. For example, the maximal clique of A − B − C − D is

A,B, B,C, C ,D.

Given a set of random variables X = (Xi)i∈V , let P(X = x) be the probability of a

particular filed configuration x in X . That is, P(X = x) is the probability of finding that the

random variables X take on the particular value x . The joint density over a Markov graph G

can be represented as

P(X = x) =∏c∈cl(G)

ϕc(xc) (1–1)

where cl is the set of maximal cliques and the positive functions ϕc(·) are called clique

potentials.

It implies a graph with independence properties defined by the cliques in product.

The results holds for Markov network G with positive distributions, which is also known as

Hammersley-Clifford theorem (Hammersley & Clifford, 1971).

1.3 Bayesian Network

1.3.1 Introduction to Bayesian Network

Bayesian network is a model over a directed acylic graph. Since the directions of the

edges are included on the model, it can represent more types of conditional independences

than the Markov network. That is, Bayesian networks can provide more accurate descriptions

than Markov networks for the relationships among random variables. For example, consider a

set of random variables, there can be four combinations of independences statements for any

two variables. Table 1-1 given an example for each of the three cases that are representable by

Markov networks. The fourth case that X and Y are marginally independent, but dependent

15

Table 1-1. Conditional independences represented by Markov networks.

Conditional Marginal Independenceindependence X ⊥ Y X ⊥ YX ⊥ Y |Z X Y – Z X – Z – Y

X ⊥ Y |Z non-representable X – Y – Z

conditioned on variable Z , is not representable by a Markov network, However, it can be easily

represented by a Bayesian network using a v -structure X → Z ← Y , which includes two

convergent directions on the edges X – Z and Y – Z . In the Bayesian formula, this

situation can be described by

π(X ,Y |Z) = π(Z |X ,Y )π(X )π(Y )π(Z)

= π(X |Z)π(Y |Z),

which quite often holds for real problems. In Bayesian networks, the direction of edges

represents the ”parent of” relationship. For this reason, Bayesian networks have often been

used in causal inference, see e.g. (Spirtes, 2010).

There are several equivalent definitions of a Bayesian network. A Bayesian network

B = (G,XV ) is composed of a Directed Acyclic Graph (DAG) G = (V ,E) and a p-dimensional

random vector X = x1, ..., xp, with a probability density that recursively factorizes according

to the DAG

P(x) =∏i

q(xi |pa(i)),

where pa(i) denotes the parents of Xi , and q(xi |pa(i)) denotes a conditional distribution of Xi

given pa(i). Therefore, if we know both the conditional independence relations about variables

in X and a set of local probability distributions associated with each variable, we can recover

the joint distribution for X .

A Bayesian network encodes a set of independencies that exist in the domain. To

guarantee the existence of these independencies in the actual population distribution, one

assumption needs to be satisfied which demonstrates that there exists no common unobserved

variables in the domain that are parent of one or more observed variables of the domain.

16

If P factorizes recursively according to G, then P factorizes according to the moral graph

Gm and obeys the global Markov property relative to Gm. Gm is constructed by marrying the

parents and deleting the directions. The converse is not true. Many independence relations in

P are not captured by Gm. This is because P has extra properties not possessed by a general

distribution that factorizes according to the undirected graph Gm.

Since P factorizes according to Gm, P satisfies the global, local and pairwise Markov

properties with respect to Gm. The conditional independence relations can be read off from

Gm. Consider the local Markov property, we have Xi ⊥ XV \ne(i)∪i|Xne(i). Since moral graph is

obtained by marrying parents, ne(i) can be written as pa(i) ∪ ch(i) ∪ j |ch(j) ∩ ch(i) = ∅,

where ch(i) = children of i in G. ne(i) is called the Markov blanket of i in DAG G, denoted

by MB(i). Therefore the local Markov property can be written as Xi ⊥ XV \MB(i)∪i|MB(i). It is

a rewrite of the local Markov property with respect to moral graph Gm.

For two vertice u, v of G, we say that u is an ancestor of v and v is a descendant of u,

if there is a path from u to v . Let the set of ancestors of v denote by an(v ) and the set of

descendants of v denote by de(v ). A is called an ancestral set if pa(v) ⊆ A,∀v ∈ A, denoted

by An.

Similarly, there are directed global, local and pairwise properties on directed graphs. The

directed global Markov property is defined as XI ⊥ XJ|XU holds whenever I and J are separated

by U in GmAn(I∪J∪U), the moral graph of the smallest ancestral set containing I ∪ J ∪ U. The

directed global Markov property in directed acyclic graph is an analogue as global Markov

property in the case of an undirected graph, in the sense that it gives the sharpest possible rule

for reading conditional independence relations off the directed graphs.

We say that P satisfies the directed local Markov property if each variable is conditionally

independent of its non-descendant given its parents variables:

Xi ⊥ XV \de(i)|Xpa(i) for all i ∈ V ,

17

where de(i) is the set of descendants and V \ de(i) is the set of non-descendants of i . In

contrast to the undirected case we have directed local Markov property and directed global

Markov property are equivalent.

Lastly, P obeys the direct pairwise Markov property if for any pair (i , j) of non-adjacent

vertices with j ∈ V \ de(i),

Xi ⊥ Xj |XV \de(i)\j for all i ∈ V .

The directed local Markov property implies the directed pairwise Markov property. The reverse

is not true in general.

Bayesian network can be defined through directed Markov property B is a Bayesian

network with respect to G if it satisfies the directed Markov property. Other definitions of

Bayesian network are based on Markov blanket and d-separation. B is a Bayesian network

with respect to G if every node is conditionally independent if all other nodes in the network,

given its Markov blanket. This definition can be made more general by d-separation. B is a

Bayesian network with respect to G if, for every triplet of disjoint sets I, J,U ⊂ V, it holds

that XI ⊥ XJ|XU whenever U d-separates I from J. Lauritzen (1996) proves the equivalence

between D-separation and directed global Markov property. Let I, J and U be the disjoint

subsets of a DAG G. Then U d-separates I from J in GmAn(I∪JU).

The process of solving Bayesian network involves structure learning and parameter

learning. The first step is to induce the structure of the model, that is, the DAG, while the

second step is to estimate the parameters of the model defined by the structure. In reality,

once the graph structure is selected, the parameter estimation problem is then reduced to a set

of lower dimensional problems. Estimating parameters can be done using standard techniques

like maximum likelihood, Bayesian estimation or regularized maximum likelihood. For example,

if we use Bayesian estimation method. A prior distribution is assumed over the parameter

of local probability density functions before data are used and the conjugacy of this prior

distribution is usually desirable.

18

Under the conditions listed belows, the structure learning algorithms considered will

discover a DAG structure equivalent to the DAG structure of the probability distribution P

(Spirtes et al., 2000):

• The independence relationships have a perfect representation as a DAG. This is alsoknown as the DAG faithfulness assumption.

• The database consists of a set of independent and identically distributed cases.

• The database of cases is infinitely large.

• No hidden (latent) variables are involved.

• The statistical tests has no error.

Two DAGs representing the same set of conditional independence relations are equivalent

in the sense that they capture the same set of probability distributions. That is, two models

M1 and M2 are statistically equivalent if and only if they contain the same set of variables and

joint samples over then provide no statistical grounds for preferring one over the other.

Any two models M1 and M2 over the same set of variables, whose graphs G1 and G2,

respectively, have the same skeleton and the same v -structure are equivalent. That is, two

DAGs G1 and G2 are equivalent of they have the same skeleton and the same set of uncovered

colliders (i.e. Xi → Xj ← Xk , structures where Xi and Xk are not connected by a link also

known as v -structures). For instance, the models Xi → Xj → Xk and Xi ← Xj ← Xk and

Xi ← Xj → Xk are equivalent since they have the same skeleton and have no v -structures.

Based on the data alone, we cannot distinguish Xi → Xj → Xk and Xi ← Xj ← Xk and

Xi ← Xj → Xk . These models can, however, be distinguished from Xi → Xj ← Xk .

It is important to note that although Bayesian networks are often used to represent causal

relationships, this need to be the case because of the existence of equivalent class.

A causal network is a Bayesian network with an explicit requirement that the relationships

be causal. The additional semantics of the causal networks specify that if a node X is actively

caused to be in a given state x (an action written as do(X = x)), then the probability density

function changes to the one of the network obtained by cutting the links from the parents of X

19

to X, and setting X to the caused value x. Using these semantics, one can predict the impact

of external interventions from data obtained prior to intervention.

Building the structure of Bayesian network is a difficult task, subjects to the above

listed conditions, and provably correct algorithms have an exponential worst-case complexity.

Identifying the exact Bayesian network structure is in general impossible due to the existence

of equivalent class. There exists some methods dealing Markov network which have much

better complexity. One possible solution one may think of is to extend some existing methods

which can deal with Markov network to learn Bayesian networks. However, none of the existing

methods developed for high-dimensional GGMs, e.g., graphical Lasso, nodewise regression and

ψ-learning, can be trivially extended to Bayesian networks dues to fundamental differences in

their structures. In particular, the v -structure need to be taken case specially when extending a

Markov network learning algorithm to Bayesian networks.

The existing Bayesian network learning methods can be traced to three categories,

constraint-based, score-based and hybrid.

1.3.2 Constraint-based Approaches

In constraint-based approaches, constraints typically refer to those conditional independence

statements, although non-independence based constraints may be entitled by the structure, in

certain cases where latent variables exist. The conditional independence tests usually can be

read off graph using d-separation criterion. Structure learning in this case is then the task of

identifying a DAG structure that best encodes a set of conditional independence relations. The

set of conditional independence relations may, for example, be derived from the data source by

statistical tests. However based on the data set alone, we can at most hope to identifying an

equivalence class of graphs encoding the conditional independence relations of the generating

distribution.

A constraint-based algorithm is based on the independence tests I (X ,Y |SXY ), which

indicates X is conditional independent of Y given subset SXY . In this case, any information

source able to provide such information works.

20

The most straightforward algorithm proposed is induction causation (IC) algorithm

algorithm (Pearl & Verma, 1995).

IC Algorithm

1. For each pair of vertices Xi and Xj , search for a set Sij such that Xi and Xj areconditional independent given Sij . If there is no such Sij , place an undirected edgebetween these two vertices.

2. For each pair of non-adjacent vertices Xi and Xj with a common neighbor XK , check ifXk ∈ Sij .If it is, then continue. If it is not, then add arrowheads pointing at Xk , (i.e. Xi → Xk ←Xj)

3. Orient as many of the undirected edges as possible subject to two conditions: (i) theorientation should not create a v-structure; and (ii) the orientation should not create adirected cycle.

However, this algorithm requires a number of conditional independence tests that

increase exponentially in the number of vertices. Even for sparse graphs the algorithm

becomes infeasible as the number of vertices increases. Besides the computational burden,

the determination of higher order conditional relations from sample distribution is generally

less reliable than is the determination of lower order independence relations. After addressing

the intractability issues, PC algorithm (Kalisch & Buhlmann, 2007) has been proposed. Since

it is enough to find one Sij making Xi and Xj independent to remove their connection, PC

algorithm proposed to do the test in certain order. The revised step is as follows:

1. If (i , j) not adjacent then i and j are d-separated either given pa(i) or pa(j):

– If all edges between a vertex k and i , and between k and j have already beenremoved, then sets S∗ such that T ∈ S∗ are need not to be considered in search fora separating set Sij .

– Restrict search for separating sets S such that either S ⊆ Adj(Xi) or S ⊆ Adj(Xj);Adj(Xv) refers to the set of vertices in the graph that are adjacent to v .

2. For each pair of variables Xi and Xj , see if Xi ⊥ Xj ; if so, remove their edge.For each pair of variables Xi and Xj which are adjacent in the graph, withmax|Adj(Xi)|, |Adj(Xj)| ≥ 2, test Xi ⊥ Xj |S , where |S | = 1 and S ⊆ Adj(Xi) orS ⊆ Adj(Xj)....

21

For each pair of variables Xi and Xj which are adjacent, with max|Adj(Xi)|, |Adj(Xj)| ≥k + 1, test Xi ⊥ Xj |S , where |S | = k and S ⊆ Adj(Xi) or S ⊆ Adj(Xj).Stop when we reach a k such that for all (Xi ,XJ), max|Adj(Xi)|, |Adj(Xj)| < k + 1

Other constraint based algorithm include the grow-shrink (GS) algorithm (Margaritis

& Thrun, 1999), incremental association (Tsamardinos et al., 2006). All these methods

are originally developed for low dimensional problems (Aliferis et al., 2010). Also, they may

involve some conditional test with the size of conditioning set close to p, which cannot be

carried out or very unreliable when p is greater than n. It is remarkable that under the sparsity

assumption which bounds the neighborhood size of each node, the PC algorithm has been

shown by Kalisch & Buhlmann (2007) to be consistent and can execute in a polynomial time

of p. Therefore, the PC algorithm has been considered in the literature as the state-of-the-art

method for learning high-dimensional Bayesian network. Recent applications and extensions of

the algorithm can be founded in Colombo et al. (2012), Verma & Pearl (1991), Harris & Drton

(2013), McGeachie et al. (2014), Ha et al. (2015), among others.

1.3.3 Score-based Approaches

The score-based algorithms view the problem of solving Bayesian network as a model

selection problem. They assign each candidate network structure a score function which

measures how well the model fits the observed data. The problem then becomes how to find

the highest-score network structure.

The scoring function can be entropy (Herskovits & Cooper, 2013), minimum description

length (Lam & Bacchus, 1994) and Bayesian scores (Heckerman et al., 1995). Under

appropriate conditions, the score-based methods can also be shown to be consistent, see

Chickering (2002) and Preetam et al. (2016) for low and high dimensional cases, respectively.

Given the graph structure G and complete data D.

Score(G,D) = P(G|D),

22

which is the posterior probability of G given the data set. We can use Bayes’ law:

Score(G,D) = P(G|D) = P(D|G)P(D)

We only need to maximize the numerator since the denominator does not depend on G. There

are several ways to calculate P(G). For simplicity, we ignore P(G), which is the same as

assuming a uniform prior on the structures.

We use θ to specify the parameters,

P(D|G) =∫P(D|G,θ)P(θ|G)dθ.

In the large sample limit the term P(D|G,θ)P(θ|G) can be reasonably approximated as

a multivariate Gaussian. Given the maximum likelihood estimation value θ and ignoring terms

that do not depend of the data set size N, the BIC score approximation can be written as

BICscore(G,D) = logP(D|θ,G)− d2log(N),

where d is the number of free parameters. The usefulness of the BIC score score comes from

the fact that it does not depend on the prior over the parameters, which makes popular in

practice where prior information is not available or is difficult to obtain.

Score-based algorithms attempt to optimize score, returning the structure G that

maximizes it. This poses many problems since the space of all possible structures is at

least exponential in the number of variables p: there are p(p − 1)/2 possible undirected edges

and 2p(p−1)/2 possible structures for every subset of these edges. What is more, the direction

of each edge is undetermined. Therefore it is not possible to calculate the score for every

possible Bayesian network structure and instead the heuristic search algorithms are employed in

practice.

A simple greedy search type of algorithm is the Hill-climbing (HC) algorithm.

Hill-Climbing Algorithm

1. Start with an initial graph structure G, e.g. the empty structure.

23

2. Repeat as long as Score(G) increases:

– provided that the operation given an acyclic graph G∗: add, delete or reverse an arcof G,

– compute the score of the new graph Score(G∗),

– if Score(G∗) > Score(G), set G = G∗ and Score(G)=Score(G∗).

Another approach of finding the best score is simulated annealing, which randomly

considers operators. If an uphill step is induced, the system will move to the new state. If

a downhill step is induced, the system will move to the new state with probability which is

inversely proportional to the reduction in score. The temperature is slowly lowered and finally

the global maximum is found.

Unfortunately, the task of finding a network structure that optimizes the scoring function

is NP-hard (Chickering, 1996), and the search process often stops at a local optimal structure.

1.3.4 Hybrid Approaches

Hybrid approaches combine the constrain-based and score-based technique to offset their

respective weakness. Both the sparse candidate algorithm (Friedman et al., 2008) and the

max-min hill-climbing (MMHC) algorithm (Tsamardinos et al., 2006) belong to the category.

The idea of these algorithms is that if we directly apply the greedy HC algorithm, the search

space could be huge. There is a need to develop methods to increase chances of building

a good quality model without exploring the whole search space exhaustively. One possible

approach is to use less computationally expensive method to determine a promising subset of

the search space on which we can subsequently apply a more systematic and costly method.

MMHC algorithm combines a constraint-based method (Max-Min-Parent-Children algorithm)

(Tsamardinos et al., 2003a) and a score-based algorithm (HC algorithm). The algorithm first

identifies the parents and children set of each variable, then performs a greedy hill-climbing

search in the space of Bayesian network. The search begins with an empty structure and

then add, delete, reverse an edge whichever can lead to the increase of score. The important

24

difference of MMHC algorithm from standard HC algorithm is that the search is constrained to

only consider adding an edge if it was discovered by MMPC in the first phase.

Not any probability distribution can be faithfully represented by a DAG. The typical

structure-learning algorithms can only deal with restricted range of data set. Faithfulness of

the distribution guarantees the existence of a DAG. Faithfulness along with Markov property

indicates that there is a one-to-one mapping between the graphical criterion of d-separation

and conditional independence in the data. In practice, existing score-based, constraint-based

and hybrid algorithm deal primarily with discrete data sets. It is known that score-based

algorithms for continuous variables are computationally expensive. GS algorithm proposed by

Magrassi et al. (2005) adopted a distribution-free test of conditional independence, however

it is computationally expensive and cannot be readily used with the current constraint-based

algorithms for all but small networks.

Variable selection is a commonly used method to reduce the number of variables for

building more robust models. The central premise when using a variable selection technique

is that the data contains many variables that are either redundant or irrelevant, and can

thus be removed without incurring much loss of information. Variable selection and casual

structure learning share one concept: the Markov blanket of a variable X is the smallest set

which contains all variables having information about X but cannot be obtained from other

variables. The Markov blanket in a causal graph includes the set of parents, children and

spouses. In variable selection, we call variables having information about the target that cannot

be obtained from other variables as strongly relevant variables. The variable selection process

and the causal graph construction process are somehow similar as Markov blanket identification

process. It is shown that the Markov blanket of a variable X is exactly the set of strongly

relevant variables and prove its uniqueness for faithfulness distributions (Tsamardinos et al.,

2003a).

25

1.4 ψ-learning Algorithm for Learning Gaussian Graphical Models

This section provides a brief review of ψ-learning algorithm (Liang et al., 2015) for

learning Gaussian Graphical models.

During the past decade, the Gaussian graphical model (GGM), as a special case of Markov

Networks, has been widely studied. The idea of learning Gaussian graphical model is to use

the partial correlation coefficient. A zero partial correlation coefficient indicated conditional

independence of the two variables. There also exits another way to measure dependency

which is is based on the correlation coefficient. However, the latter is less powerful due to

the fact that all variables in a system are more or less correlated. A variety of methods have

been proposed for constructing Gaussian graphical models from observed data. A popular

method is covariance selection (Dempster, 1972), which identifies the nonzero elements in

the concentration matrix (i.e, inverse of covariance matrix) because the nonzero entries in the

concentration matrix correspond to the conditional dependent variable. Furthermore, Lauritzen

(1996) showed that the partial correlation coefficient between X (i) and X (j) given all other

variables can be expressed as

ρij |V \i ,j = −Ci ,j√Ci ,iCj ,j

, i , j = 1, ..., p, (1–2)

where Ci ,j denotes the (i , j)-entry of the concentration matrix, and V = 1, 2, ..., p denotes the

set of indices of all variables of a system. However, this approach cannot be applied in the case

of p > n, where the sample covariance matrix is singular and thus the concentration matrix

can no longer be directly estimated. To tackle this difficulty, regularization methods such

as nodewise regression (Meinshausen & Buhlmann, 2006) and graphical Lasso (Yuan & Lin,

2007; Friedman et al., 2008; Danaher et al., 2014) have been proposed. Nodewise regression

uses Lasso ((Tibshirani, 1996)) as a variable selection method to identify the neighborhood

of each variables, which corresponds to the nonzero elements of the concentration matrix.

A neighborhood is the set of predicator variables with nonzero coefficient in a regression

model estimated separately for each variable. Meinshausen & Buhlmann (2006) showed that

26

this method asymptotically recovers the true graph. To avoid estimating a large number of

regressions, Yuan & Lin (2007) proposed to directly estimate the concentration matrix using

the regularization method with a l1-penalty. The method is then accelerated by Friedman

et al. (2008) using a coordinate descent algorithm that was originally designed for Lasso

regression and this led to the so-called graphical Lasso algorithm. Another popular method

to learn Gaussian graphical model is based on limited order partial correlations. An important

algorithm belonging to this category is the PC algorithm (Spirtes et al., 2000), which works

in an iterative procedure: It starts with a full graph with edges between all variables, and then

for each edge of the current graph, it searches for a subset Q such that the two variables

connected by the edge are conditional independent given Q. If such a set Q is found, then the

corresponding edge is removed. Since the PC algorithm searches for the maximum of a set of

p-values, it can be very slow when p is large. Quite recently, Liang et al. (2015) proposed the

ψ-learning method, which works on an equivalent measure of partial correlation coefficients

calculated with reduced conditional sets. Let ψij denote the equivalent measure of the partial

correlation coefficient ρij |V \i ,j. They are equivalent in the sense that

ψij = 0 ⇐⇒ ρij |V \i ,j = 0, (1–3)

provided that the GGM satisfies the Markov property and adjacency faithfulness condition.

The GGM can be represented by an undirected graph G = (V,E), where V, with a slight

abuse of notations, denotes the set of p vertices corresponding to the p variables X 1, ...,X (p),

and E = (eij) denotes the adjacency matrix. If two vertices i , j ∈ V form an edge, and we say

that i and j are adjacent and set eij = 1. The boundary set of a vertex v ∈ V, denoted by bG,

is the set of vertices adjacent to v , that is, bG(v) = j : evj = 1. The boundary set is also

called neighborhood. A path of length l > 0 from v0 to vl is a sequence v0, v1, ..., vl of distinct

vertices such that evk−1,vk = 1 for all k = 1, ..., l . The subset U ⊂ V is said to separate I ⊂ V

from J ⊂ V if for every i ∈ I and j ∈ J, all paths from i to j have at least one vertex in U. For

a pair of vertices i = j with eij = 0, a set U ⊂ V is called an i , j-separator if it separates

27

i and j in G. Let Gij be a reduced graph of G with eij being set to zero. Then both the

boundary set bGij (i) and bGij (j) are i , j-separators in Gij .

Let XV denote a random vector indexed by V = 1, ..., p with probability distribution

PV. Let A ⊂ V be a subset of V, and let PA be the marginal distribution associated with the

random vector indexed by A. For a triplet I, J,U ⊂ V, we use XI ⊥ XJ|XU to denote that XI is

conditional independent of XJ given XU.

Let rij denote the correlation coefficient of variable X (i) and X (j). Let G = (V, E) denote

the correlation graph of X (1), ...,X (p), where E = (eij) is the adjacency matrix with eij = 1 if

|rij | > 0 and 0 otherwise. Let rij denote the empirical correlation coefficient of Xi and Xj , let ri

denote a threshold value, and let Eri ,i = v : |riv | > rj \ i denote a reduced neighborhood

of node i in the empirical correlation graph. For convenience, we define Erj ,j = v : |rjv | > rj,

Eri ,i ,−j = v : |riv | > ri \ j, and Erj ,j ,−i = v : |rjv | > rj \ i. For any pair of vertices i and

j , we define the partial correlation coefficient ψij by

ψij = ρij |Sij , (1–4)

where Sij = Eri ,i ,−j if |Eri ,i ,−j | < |Erj ,j ,−i | and Sij = Eri ,i ,−j otherwise, and |D| denotes the

cardinality of the set D. To distinguish ψij from conventional partial correlation coefficients, we

call it ψ-partial correlation.

Definition 1. (Markov property) We say that pV satisfies the Markov property with respect to

G if for every triplet of disjoint I, J,U ⊂ V, it holds that XI ⊥ XJ|XU whenever U separates I

and J in G.

Definition 2. (Adjacency faithfulness) We say that PV satisfies the adjacency faithfulness

condition with respect to G: If two variables X(i) and X(j) are adjacent in G, then they are

dependent conditioned on any subset of XV \ i , j.

The adjacency faithfulness condition implies that if there exists a subset U ⊆ V \ i , j

such that X (i) ⊥ X (j)|XU, then X (i) and X (j) are not adjacent in G. Furthermore, by the

28

Figure 1-2. Illustrative plot for calculation ψ-partial correlation coefficients, where the solid anddotted edges indicate the direct and indirect associations, respectively. The left andright shaded ellipses cover, respectively, the reduced neighborhoods of node i andnode j in the correlation graph.

Markov property, we have

X (i) ⊥ X (j)|XU =⇒ X (i) ⊥ X (j)|XV\i ,j for any U ⊆ V \ i , j.

In particular, if U = ∅, we have

X (i) and X (j) are marginally independent =⇒ X (i) ⊥ X (j)|XV\i ,j,

or, equivalently,

ρij |V\i ,j =⇒ corrX (i),X (j) = 0.

where corrij denotes the correlation coefficient of Xi and Xj , and ρij |V\i ,j denote the partial

correlation coefficient of Xi and Xj conditioned on all other variables. Since the essence of

the Gaussian graphical model is to find the pairs of random variables for which the partial

correlation coefficient is equal to zero, a correlation screening procedure can be applied to

reduce the size of conditioning set in calculating the partial correlation coefficient. Let ψij

denote the partial correlation coefficient calculated with the reduced conditioning set Sij , i.e.

ψij = ρij |Si ,j . Under Markov property and faithfulness condition, Liang et al. (2015) showed that

29

ψij is equivalnet to ρij |V\i ,j in learning the structure of the Gaussian graphical model in the

sense that

ψij = 0 ⇐⇒ ρij |V\i ,j = 0.

Further, under the mild conditions for the sparsity of the underlying GGM, Liang et al. (2015)

showed that the size of Sij can be bounded by n/log(n). Therefore, the ψ-learning algorithm

has successfully reduced the problem of partial correlation coefficient calculation from a

high-dimensional setting to a low dimensional one. Note that ρij |V\i ,j is even not calculable

when p is larger than n. In summary, the ψ-learning algorithm consists of the follow two steps

to calculate the ψ-learning correlation coefficients:

Algorithm 1.1. (ψ-learning algorithm)

(a) (Correlation screening) Determine the reduced neighborhood for each variable Xi .

(i) Conduct a multiple hypothesis test to identify the pairs of variables for which theempirical correlation coefficient is significantly different from zero. This step resultsin a so-called empirical correlation network

(ii) For each variable Xi , identify its neighborhood in the empirical correlation network,and reduce the size of neighborhood to O(n/log(n)) by removing the variableshaving lower correlation (in absolute value) with Xi . This step results in a so-calledreduced correlation network

(b) (ψ-calculation) For each pair of variables Xi and Xj , identify the separator Sij based onthe reduced correlation network resulted in step (a) and calculate ψij = ρij |Sij , whereρij |Sij denotes the partial correlation coefficient Xi and Xj calculated for the dataset Xconditioned on the variable Xl : l ∈ Sij.

(c) (ψ-screening) Conduct a multiple hypothesis test to identify the pairs of vertices forwhich ψij is significantly different from zero, and set the corresponding elements of E tobe 1.

The bound of neighborhood size is suggested to be set as n/[ξn log(n)], where ξn is a

tunable parameter and has a default value of 1. For some problems, one may set ξn > 1, say

2 or 3; if n is too small, one may set ξn < 1, say 1/2 or 1/3, while ensuring the condition

n/[ξn log(n)] < n − 4 holds. The ψ-learning algorithm is very convenient for incorporating our

prior knowledge into network construction. For example, if we know some pair of variables, say

30

Xi and Xk , are correlated, then we can always include Xk into the set Eri ,i and include Xi into

the set Erk ,k even if the empirical correlation between Xi and Xk is not strong enough.

We apply Fisher’s transformation to rij to get

zij =1

2log

[1 + rij1− rij

],

which approximately follows a normal distribution with mean 0 and variance 1/(n − 3)

under the null hypothesis H0 : rij = 0. Based on this asymptotic result, we calculate p-value for

the test H0 : rij = 0↔ H1 : rij = 0, and then apply the probit transformation to the p-value to

get

zij = Φ−1(1− 2[1−Φ(

√n − 3|zij |)]) = Φ−1(2Φ(

√n − 3|zij |)− 1), (1–5)

where Φ(·) denotes the cumulative distribution function of the standard normal distribution.

For convenience, we call zij a correlation score. Due to the monotonicity of the transformation

(1–5) in |zij | it converts a double-sided test H0 : zij = 0 ↔ zij = 0 to a single test

H0 : zij = 0 ↔ H1 : zij > 0. Therefore, it can be used as a test statistic for identification of

non-zero correlation coefficients.

Similarly, we let ψij denote the empirical value of ψij . Applying Fisher’s transformation to

ψij , we get

z ′ij =1

2log

[1 + ψij

1− ψij

], (1–6)

which approximately follows a normal distribution with mean 0 and variance 1/(n − |Sij − 3).

Then, we calculate the p-value for the corresponding test and apply the probit transformation

to the p-value to get

z ′ij = Φ−1

(2(Φ(2

√n − |Sij | − 3|z ′ij |)− 1

), (1–7)

which is called ψ-score. Similarly, it can be used as a test statistic for identification of non-zero

partial correlation coefficients and thus the structure of the Gaussian graphical model.

31

CHAPTER 2UNDIRECTED GRAPHICAL MODEL FOR COUNT DATA

2.1 RNA-seq Data and Poisson Graphical Models

In recent years, next generation sequencing (NGS) has gradually replaced microcarray

as the major platform in transcriptome studies, say, through sequencing RNAs (RNA-seq).

RNA-seq uses counts of reads to quantify gene expression levels. Compared to microarray-data,

RNA-seq data have many advantages, such as providing digital rather than analog signals of

expression levels, dynamic and wider ranges of measurements, less noise, higher throughput,

etc. However, their discreteness also challenges the existing methods. In practice, RNA-seq

data are often modeled using Poisson (Sultan et al., 2008) or negative-binomial distribution

(Anders & Huber, 2010; Robinson & Oshlack, 2010), but difficulties often arise in the

computation or knowing the properties of the statistics based on these distributions.

Let Y = (Y1, ...,Yp) denote a p-dimensional Poisson random vector associated with a

graphical model G. It is natural to assume that all the node-conditional distributions, that

is, the conditional distribution of one variable given all other variables, are Poisson with the

distribution given by

P(Yj |Yk ,∀k = j ;Θj) = exp

[θjYj − log(Yj !) +

∑k =j

θjkYjYk − A(θjθjk)

]. (2–1)

where Θj = θj , θjk , k = j, and A(θj , θjk) is the log-partition function of the Poisson

distribution. Following from the Hammersley-Clifford theorem (Besag, 1974), the node-conditional

distribution combine to yield the joint Poisson distribution

P(Y,Θ) = exp

[P∑j=1

(θjYj − log(Yj !)) +∑j =k

θjkYjYk − ϕ(Θ)

](2–2)

where Θ = (Θ1, ..., Θp) and ϕ(Θ) is the normalizing term ensuring the properness of this

distribution. However, the Poisson graphical model suffers from a major caveat: the interaction

parameters θjk must be nonpositive for all j = k to ensure ϕ(Θ) to be finite and thus the

distribution P(Y,Θ) to be proper (Besag, 1974; Yang et al., 2012). Therefore, the Poisson

32

graphical model only permits negative conditional dependencies, which is a severe limitation

in practice. As shown in Patil et al. (1968), the negative binomial graphical model also suffers

from the same limitation.

To relax this limitation, Allen & Liu (2013) proposed a local Poisson graphical model

(LPGM), which ignores the joint distribution of Yj ’s, and works by finding a local model

for each gene using a regularization method based on the conditional distribution (2–1) and

then defining the network structure as the union of the local models. To account for the

high dispersion of the NGS data when the inter-sample variance is greater than the sample

mean, Gallopin et al. (2013) proposed a hierarchical log-normal Poisson model which assumes

Yij ∼ Poisson(λij) with log(λij) =∑k =j βjk yik + ϵij , for j = 1, ..., n, where ϵij is a Gaussian

random variable, and yik denotes standardized, log-transformed data. For each variable Yi , the

local model can be found via a regularization approach for the log-normal Poisson regression.

Quite a few related models have been proposed along this direction, including the truncated

PGM, quadratic PGM, sub-linear PGM and square-root PGM. Refer to Yang et al. (2012) and

Inouye et al. (2016) for the detail. However, this LPGM-based methods are non consistent

due to their ignorance of the joint distribution of Y ′j s . Without the joint distribution, the

conditional dependence Yk ⊥ Yj |YV\k,j is not well defined and therefore the theoretical

basis Yk ⊥ Yj |YV\k,j ⇐⇒ θkj = 0 and θjk = 0 of the nodewise regression (Meinshausen

& Buhlmann, 2006; Ravikumar et al., 2010) does not hold, where θkj and θjk are defined in

equation (2–1). Hence, linking the Poisson graphical model to nodewise Poisson regression will

not lead to a consistent estimate for the underlying network.

We propose a random effect model-based transformation for RNA-seq data. This

transformation transforms count data to continuous data, which can be further transformed to

Gaussian data via a semiparametric transformation as described in Liu et al. (2009). Then, we

adopt the ψ-learning method developed in Liang et al. (2015) to construct Gaussian graphical

models (GGMs) for the transformed data. Under mild regularity and sparsity conditions, we

33

show that the proposed method is consistent. Transforming count data to continuous data

greatly facilitates the analysis of NGS data.

The remainder of this chapter is organized as follows. Section 2.2 describes the

random effect model-based transformation, and gives a brief review for the semiparametric

transformation of Liu et al. (2009) and ψ-learning method of Liang et al. (2015). Section 2.3

illustrates the proposed method using simulated data along with comparisons with gLasso,

nodewise regression, LPGM, and some other existing methods.

2.2 Method

The proposed method consists of three steps: (i) data-continuized transformation, (ii)

data-Gaussianized transformation, and (iii) ψ-learning, which are described in sequel as follows.

2.2.1 Data-Continuized Transformation

To continuize the RNA-seq data, we propose a random effect model-based transformation.

Let Yij denote the RNA-seq expression of gene i = 1, ..., p and j = 1, ..., n, where p denotes

the number of genes and n denotes the number of subjects. We assume that

Yij ∼ Poisson(θij), θij ∼ Gamma(αi , βi), (2–3)

where αi and βi are two parameters of the Gamma distribution. It is easy to see that (2–3)

forms a random effect model with the gene-specific random effect modeled by a Gamma

distribution. If we integrate out θij from the joint distribution f (yij , θij |αi , βi), we will have

Yij distributed according to a negative binomial distribution NB(r , q) with r = βi and

q = αi/(1 + αi). Hence, the model (2–3) is quite flexible, which accommodates potential

overdispersion of the data.

To avoid an explicit specification for the values of αi and βi , we conduct a Bayesian

analysis for the model. For this purpose, we let αi and βi be subject to the prior distributions:

αi ∼ Gamma(a1, b1), βi ∼ Gamma(a2, b2).

34

where a1, b1, a2 and b2 are prior hyperparameters. By the assumption that αi and βi are

a priori independent, the full conditional posterior distribution of θij ,αi and βi are given as

follows,

f (αi |θij , βi , yi) ∝αa1−1i

Γn(αi)eαi(−b1+n logβi+

∑nj=1 log θij)

f (βi |αi , θij , yi) ∝ βnαi+a2−1i e−βi(∑nj=1 θij+b2)

∝ Gamma(nαi + a2,n∑j=1

θij + b2),

f (θij |αi , βi , yi) ∝ θyij+αi−1ij e−θij (1+βi ),

(2–4)

where yi = yij : j = 1, 2, ..., n. Regarding the choice of prior hyperparameters, we establish

the following lemma, whose proof is given in the Appendix.

Lemma 1. If a1 and a2 take small positive values, then for all i and j , the posterior mean of

θij , denoted by E [θij |yi ], will converge to yij as b1 →∞ and b2 →∞.

Suppose, that a MCMC algorithm, example, the Metropolis-within-Gibbs sampler (Muller,

1992), was used to simulate from the posterior distribution (2–4). Let θ(t)ij denote the posterior

sample of θij for t = 1, 2, ...,, and let θ(T )ij =∑Tt=1 θ

(t)ij /T denote the Monte Carlo estimator

of E [θij |yi ]. Then, following from the standard theory of MCMC, we have θ(T )ijp→ E [θij |yi ] as

T → ∞, wherep→ denotes convergence in probability. To ensure the convergence θ(T )ij

p→ yij

hold in a rigorous manner, the iteration number T and the prior hyper-parameteres b1 and

b2 need to go to infinity simultaneously. To achieve this goal, we let b(t)1 and b(t)2 denote the

respective values of b1 and b2 taken at iteration t, and we set

b(t)1 = b

(t−1)1 +

c

tζ, b2 = b

(t−1)2 +

c

tζ, t = 1, 2, ..., (2–5)

where b(0)1 and b(0)2 are fixed large constants, c > 0 is a small constant, and 0 < ζ ≤ 1. Under

this setting, the MCMC sampler for (2–4) forms an adaptive Markov chain for which the target

distribution gradually shrinks toward a Dirac delta measure defined on (αi , βi , θij) = (0, 0, yij).

For simplicity in theoretical development (see Appendix A), we assume that a random walk

35

proposal is used in simulating from the conditional posterior distribution f (αi |·), that is, the

proposal distribution q(α′i |α(t)i ) = q(|α′

i − α(t)i |) depends on |α′

i − α(t)i | only. In summary, we

have the following lemma, whose proof is given in the Appendix.

Lemma 2. If a random walk proposal is used in simulating from f (αi |·) and the prior

hyperparameters are chosen in (2–5), then θ(T )ijp→ Yij for all i and j as T → ∞, where

θ(T )ij =∑Tt=1 θ

(t)ij /T and θ(t)ij denotes the posterior sample of θij generated at iteration t.

Lemma 2 implies that the statistical inference for yij ’s can be approximately made using

θTij ’s as T → ∞. The validity of the approximation can be argued as follows: Let F(y1, ..., yp)

denote the empirical CDF of (Y1, ...,Yp). It is easy to see that the convergence θ(T )ijp→ yij

implies that supt∈Rp ∥Fθ(T )1 ,...,θ(T )p (t) − Fy1,...,yp(t)∥p→ 0 as T → ∞. Further, as the sample

size n → ∞, supt∈Rp ∥Fy1,...,yp(t) − FY1,...,Yp(t)∥a.s.→ 0 holds under some regularity and

sparsity conditions, where FY1,...,Yp(t) denotes the CDF of Yi ’s, anda.s.→ denotes almost sure

convergence. For example, we can assume that for each Yi , the number of variables that Yi

depends on us upper bounded by n/ log n. In summary, we have supt∈Rp ∥Fθ(T )1 ,...,θ(T )p (t) −

FY1,...,Yp(t)∥p→ 0 as T →∞, which implies that a consistent estimate can be formed based on

the continuized data for each conditional probability used for inference of the network structure

underlying Y1, ...,Yp. That is, the conditional independence relations among Y1, ...,Yp can be

learned from the continuized data θ(T )1 , ..., θ(T )p in a consistent manner.

2.2.2 Data Gaussianized Transformation

Since GGMs have been extensively stuided, we seek for a transformation that transforms

the continuized data to be Gaussian, while maintaining the conditional independence relations

among the variables. The semiparametric Gaussian copula transformation, the so-called

nonparanormal transformation, proposed by Liu et al. (2012) satisfies this requirement. It can

be described as follows.

Let X = (X1, ...,Xp)T be a continuous p-dimensional random vector. It is said that X

has a nonparanormal distribution if there exist function fj = j = 1p such that Z = f (x) ∼

N(µ, Σ), where f (x) = (f1(X1), ..., fp(Xp))T . We write X ∼ NPN(µ, Σ, f ). It is known that

36

if f ′j s are monotone and differentiable, the joint probability density function of X is given by

PX (x) =1

(2π)p/2|Σ|1/2exp

−12(f (x)− µ)TΣ−1(f (x)− µ)

·p∏j=1

|f ′j (xj)|. (2–6)

Based on this formula, Liu et al. (2009) argued that if X ∼ NPN(µ, Σ, f ) and each fj is

monontone and differentiable, then Xi ⊥ XjXv\i ,j ⇐⇒ Zi ⊥ Zj |Zv\i ,j. With the similar

argument, we have that for any triplet of disjoint sets A,B,C ⊆ V ,XA ⊥ XB |XC ⇐⇒

ZA ⊥ ZB |ZC . In other words, the nonparanormal transformation preserved the conditional

independence structure of the original graphical model formed by X. Liu et al. (2009) further

showed that fj(x) = µj + σjΦ−1(Fj(x)) is such a monotone and differentiable transformation,

where µj is the mean of Xj , σ2j is the variance of Xj , and Fj(x) is the CDF of Xj . For the high

dimensional case where p is greater than and case increase with n, Fj(x) can be replaced by

a truncated or Winsorized estimator of the marginal empirical distribution of Xj in order to

reduce the variance of the estimate.

As shown in Liang et al. (2015) , the ψ-learning method is consistent, that is, the network

produced by it will converge to the true one as the sample size n →∞.

The multiple hypothesis tests involved in the correlation screening and ψ-screening steps

can be done using an empirical Bayes method developed in Liang & Zhang (2008). The

advantage of this method is that it allows for general dependence between test statistics,

for example, Benjamini et al. (2006), can also be applied here. The performance of multiple

hypothesis tests depend on their significance levels. Following the suggestions of Liang et al.

(2015), we set the significance level of correlation screening to be α1 = 0.2 and that of

ψ-screening to be α2 = 0.05. In general, a high significance level of correlation screening will

lead to a slightly larger separator set Sij , which reduces the risk of missing some important

variables in the conditioning set. Including a few false variables in the conditioning set will not

hurt much the accuracy of ψ-partial correlation coefficients.

37

2.2.3 Consistency

In summary, the proposed method consists of three steps: (i) data-continuized transformation,

(ii) data-Gaussianized transformation, and (iii) ψ-learning for GGMs. From Lemma 2 and

the followed arguments, we can conclude that the network structure of Y1, ...,Yp can be

consistently learned from the continuized data θ(T )1 , ..., θ(T )p . Liu et al. (2009) showed that

the data-Gaussianized transformation preserves the network structure underlying the data,

and Liang et al. (2015) showed that the ψ-learning method is consistent in recovering the

underlying network structure. Therefore, the consistency holds for the proposed method;

that is, the true gene regulatory relations can be recovered from the RNA-seq data using the

proposed method when the sample size becomes large.

2.3 Simulation Studies

To illustrate the performance of the proposed method, we consider some simulation

examples with the known conditional independence structure. Since the most NGS data tend

to be zero-inflated and highly over-dispersed, the data were simulated from a multivariate

zero-inflated negative binomial (ZINB) distribution. The ZINB distribution contains three

parameter, λ, κ, and ω, which controls its mean, dispersion and degree of zero-inflation,

respectively. The algorithm developed by Yahav & Shmueli (2012) was adopted to simulate the

data, which works via an inverse nonparanormal transformation as follows:

(a) Simulate a random sample of n multivariate Gaussian random variables with the knownconcentration matrix. Denote the random sample by (X1, ...,Xp), where each variableXi = (Xi1, ...,Xin)

T consists of n realizations.

(b) For each variable Xi , find the empirical CDF based on n realizations and calculate thecumulative probability value for each realization Xij .

(c) Generate a random sample of n zero-inflated negative binomial random variables withpre-specified parameters λ, κ and ω by inverting the cumulative probability valuesobtained in (b).

38

In our simulations, we set the concentration matrix as follows:

Cij =

0.5, if |j − i | = 1, i = 2, ..., (p − 1),

0.25, if |j − i | = 2, i = 3, ..., (p − 2),

1, if j = i , i = 1, ..., p,

0, otherwise.

(2–7)

This matrix has been used by quite a few authors to demonstrate their GGM algorithms,

say, Yuan & Lin (2007), Mazumder & Hastie (2012), and Liang et al. (2015). To make the

simulation similar to the real world, we set the parameters λ,κ and ω of the ZINB distribution

to their estimates from a real dataset, Acute myeloid leukemia (AML) mRNA sequencing

data, which is available on The Cancer Genome Atlas (TCGA) data portal. We estimated

these parameters using the function ”glm.nb” in R for each gene, and then set the simulation

parameters to the medians of the estimates: λ = 515, 743, κ = 3.304 and ω = 0.003. For

the other parameters, we set n = 100 and p = 200. We then applied the proposed method

to the simulated data, which went through the steps of data-continuized transformation,

nonparanormal transformation, and ψ-learning. To measure the performance of the method,

we plot the precision-recall curve in Figure 2.3, which is drawn by fixing the significant level to

α1 = 0.2 and varying the value of α2, the significant level of ψ-learning.

To conduct the data-continuized transformation, the Metropolis-within-Gibbs sampler

was run for 10,000 iterations for this dataset, where the first 1000 iterations were discarded

for the burn-in process and the remaining iterations were used for inference. The total CPU

time cost by the sampler was 39.0 sec on a personal computer with 2.8GHZ Intel Core i7. On

average, it cost less than 0.2 sec per variable. For this transformation, we set α1 = α2 = 1,

b(0)1 = b

(0)2 = 10, 000, c = 1, and ς = 1, the default setting of the prior hyperparameters used

throughout the article. The left panel of Figure 2.3 shows the scatter plot of the continuized

data versus raw counts for one variable, and the right panel shows the Q-Q plot of the

Gaussianized data for the variable. The scatter plot indicated that the continuized data

39

and the raw counts are very close to each other. To have a thorough exploration for the

data-continuized transformation, we reported in Table 2-1 the posterior mean and standard

deviation of αi , βi , θij , and the AUC value, that is, the area under the precision-recall curve,

for measuing the performance of the proposed method. The results indicate again that θij

can be very close to yij and our method is robust to the choice of (a1, a2, b(0)1 , b

(0)2 ). The

data-continuized transformation does not lose much information of the raw counts.

For comparison, we have applied the existing methods, including gLasso, nodewise

regression, Local Poisson Graphical Model (LPGM), Truncated Poisson Graphical Model

(TPGM) and Sublinear Poisson Graphical Model (SPGM) to the simulated data. For

gLasso and nodewise regression, the simulated ZINB data first went through the logarithm

transformation and nonparanormal transformation, which have been widely used in RNA-seq

data analysis, and then the methods were applied. The gLasso and nodewise regression

methods have been implemented in the R-package huge Zhao et al. (2015). In our application,

the stability approach was used to determine their regularization parameters. The stability

approach selects the network with the smallest amount of regularization that simultaneously

makes the network sparse and replicable under the random sampling. For LPGM, we used

the method proposed by Allen & Liu (2013). For SPGM and TPGM, we used the method

proposed by Yang et al. (2013). The three methods have been implemented in the R-package

XMRF (Wan et al., 2015). Besides these existing methods, we also compared the proposed

method with the one without data-continuized process, that is, ψ-learning with logarithm and

nonparanormal transformation, which is labeled as ”Log+NPN+ψ-Learning” in Figure 2.3.

40

Figure 2-1. Left: Scatter plot of the continuized data versus raw counts for one variable. Right:QQ-plot of the Gaussianized data for one continuized variable.

Figure 2-2. Precision-recall curved produced by the proposed method(Cont+NPN+ψ-learning), log-transformation-based (Log+NPN+ψ-learning), logtransformation-based gLasso (log+NPN+gLasso), log transformation-basednodewise regression (log+NPN+nodewise regression), LPGM, SPGM, TPGM forthe simulated data with n, p = (100, 200).

41

Table 2-1. The posterior mean and standard deviation of αi , βi and θij for one simulated variable, where a1 = a2 = a and

b(0)1 = b

(0)2 = b

(0)

a b(0) Yij θij αi βi AUC

1104 513.37(284.47) 513.27(284.38) 3.01× 10−7(2.04× 10−6) 6.58× 10−6(6.64× 10−6) 0.940106 513.32(284.41) 513.32(284.41) 8.58× 10−7(5.99× 10−6) 9.46× 10−7(9.52× 10−7) 0.9411010 513.37(284.47) 513.37(284.47) 7.47× 10−7(5.41× 10−6) 9.87× 10−11(9.71× 10−11) 0.943

0.001104 513.44(284.43) 513.44(284.43) 6.54× 10−7(5.03× 10−6) 1.58× 10−8(5.10× 10−7) 0.941106 513.51(284.45) 513.51(284.45) 3.78× 10−7(2.15× 10−6) 1.15× 10−9(2.87× 10−8) 0.9411010 513.37(284.48) 513.37(284.48) 5.75× 10−7(3.56× 10−6) 6.24× 10−14(1.72× 10−12) 0.942

42

The comparison indicates that the proposed method significantly outperforms other

existing methods, although the improvement mainly comes from ψ=learning. The data-continuized

transformation does not loss the information of the data, and it provides a justification for the

empirical use of treating log-NGS data as continuous. Multiple datasets have been tried, the

results are very similar. Note that LPGM is an extension of the nodewise regression method

(Meinshausen & Buhlmann, 2006) to multivariate Poisson. Both the LPGM and nodewise

regression methods are based on the idea of neighborhood selection. This experiment also

shows that the data-continuized transformation and nonparanormal transformation improves

the performance of the neighborhood selections method. Based on this experiment, we suspect

that the graph consistency established in Meinshausen & Buhlmann (2006) for nodewise

normal regression might not hold for LPGM.

We have also considered several common network structures such as hub, scale-free,

small-world and random. The multivariate Gaussian random variables given these structures

can be generated by functions provided in ”huge” packages. Then, we continue steps (b) and

(c) of Yahav and Shmueli’s algorithm to get ZIBN samples with the same parameters as used

before, that is, (n, p) = (100, 200), λ = 515, 743, κ = 3.304 and ω = 0.003. The results are

summarized in Figure 2.3. It shows that the proposed method significantly outperform all other

methods for the scale-free, small world and random structures, and performs similarly to gLasso

and nodewise regression for the hub structure. To have a through comparison with the existing

methods, we also considered the scenario of n > p with the results reported in Figure 2.3.

2.4 Real Data Examples

2.4.1 Liver Cytochrome P450s Subnetwork

Liver cytochrome P450s play critical roles in drug metabolism, toxicology, and metabolic

processes. They form a superfamily of monooxygenases critical for anabolic and catabolic

metabolism in all organisms characterized so far (Nelson et al., 1996; Aguiar et al., 2005;

Plant, 2007). Specifically, P450 enzymes are involved in the metabolism of various endogenous

and exogenous chemicals, including steroids, bile acids, fatty acids, eicosanoids, xenobiotics,

43

Figure 2-3. Precision-recall curves of each method for different type of structures with (n, p) =(100, 200). Upper left: hub; upper right: scale-free; lower left: small-world; lowerright: random.

environmental pollutants, and carcinogens (De Montellano, 2005). Through experimental work,

Yang et al. (2010) determined the human liver transcriptional network structure, uncovered

subnetworks representative of the P450s gene regulatory network, as shown in the left panel of

Figure 2.4.1. The genes ”AK097548s”, ”BC019583”, ”ENST00000301162”, and ”NM-173466”

have been excluded from our study, as they are non protein-coding genes and their expression

data are not available in the original dataset. According to the proposed method, we first

applied the data-continuized transformation, we adjusted some effects that potentially affects

the distribution of the data, including the age, gender, and batch of data collections, through

linear regression. Then, we applied the nonparanormal transformation and ψ-learning method

to the adjusted data. Figure 2.4.1 shows the resulting subnetwork.

44

Figure 2-4. Precision-recall curves of each method for different type of structures with (n, p) =(500, 200). Upper left: hub; upper right: scale-free; lower left: small-world; lowerright: random.

2.4.2 Acute Myeloid Leukemia mRNA Sequencing Network

This example illustrates the performance of the proposed method in the small-n-large-p

scenario. The dataset is the mRNA sequencing data from AML patients and available at TCGA

data portal (http://cancergenome.nih.gov/). In this study, we directly worked on the raw count

data, which contains 179 patients and 19,990 genes. In preprocessing the data, we filtered out

some low expression genes: we first excluded the genes with at least one zero count, and then

selected 500 genes with largest inter-sample variance as suggested by Gallopin et al. (2013).

45

Figure 2-5. Left: P450 gene regulatory subnetwork produced from Yang et al. (2010), wherethe known regulators and P450 genes are shown as blue rectangles and red ovals,respectively. Right: the subnetwork produced by the proposed method.

Figure 2-6. GRN produced by the proposed method for the AML RNA-seq data with(n, p) = (179, 500).

46

The selected genes are more likely linked to the development of AML as their expression levels

are highly variable.

Figure 2.4.1 shows the GRN produced by the proposed method for the AML RNA-seq

data. Through this network, we can identify some hub genes that are likely related to AML.

A sub gene refer to a gene which has a strong connectivity to other genes. Our finding is

pretty consistent with the existing knowledge. For example, the hub gene MK167 is a well

known tumor proliferation marker. The prognostic value of MK167 protein expression has

been reported for many types of malignant tumors including brain, breast, and lung cancer,

with only a few exceptions for certain types of tumors (Mizuno et al., 2009). Another example

is the gene KLF6. Humbert et al. (2011) showed the expression patterns of KLFs with a

putative role in myeloid differentiation in a large cohort of primary AML patient samples,

CD34+ progenitor cells and granulocytes from healthy donors. They found that KLF2,

KLF3, KLF5, and KLF6 are significantly lower expressed in AML blast and CD34+ progenitor

cells compared to normal granulocytes, and that KLF6 is upregulated by RUNX1-ETO and

participates in the RUNX1-ETO gene regulation. This finding provides new insights into the

under-studied mechanism of RUNX1-ETO target gene upregulation and identifies KLF6 as a

potentially important protein for further study in AML development (DeKelver et al., 2013).

The biological functions of other hub genes, such as H3F3B and TMC8, will be further studied.

For comparison, gLasso, nodewise regression, and LPGM have been applied to this

dataset. They were run as for the simulated examples. Nodewise regression and gLasso were

run using the package huge under their default setting, but the regularization parameter was

determined using the stability approach. LPGM was run using the package XMRF under its

default setting. All these methods produced much denser networks than the proposed method.

To assess the quantity of networks produced by different methods, the power law curve (see,

e.g., Kolaczyk (2009), pp.80-85) was fit to them. A nonnegative random variable X is said to

have a power law distribution if

p(X = x) ∝ x−ν, (2–8)

47

for some positive constant ν. The power law states that the majority of vertices are of very low

degree, although some are of much higher degree. A network whose degree distribution follows

the power law is called a scale-free network and it has been verified that many biological

networks are scale-free, for example, gene expression networks, protein-protein interaction

networks, and metabolic networks (Barabasi & Albert, 1999). Figure 2.4.2 shows the log-log

plots of the degree distribution of the networks generated by four methods, where the curves

are fitted by the loess function in R. It shows that the network produced by the proposed

method approximately follows the power law, while those by gLasso, nodewise regression, and

LPGM do not.

2.5 Discussion

We have proposed a method for learning GRNs from RNA-seq data. The proposed

method is a combination of a random effect model-based data-continuized transformation,

the nonparanormal transformation, and the ψ-learning algorithm. The proposed method

is consistent in the sense that the true gene regulatory network can be recovered from the

RNA-seq data when the sample size becomes large. The major contribution of the proposed

method lies on the data-continuized transformation, which fills the theoretical gap of how to

transform NGS data to continuous data and facilitates learning of gene regulatory networks.

The proposed data-continuized transformation involves an adaptive Markov chain. We proved

the convergence and the weak law of large numbers for the adaptive Markov chain under

the framework provided by Liang et al. (2016). A strong law of large numbers (SLLN) can

potentially be proved for the algorithm under the framework provided by Fort et al. (2011).

With the SLLN, some stronger theoretical properties might be obtained for the resulting

networks.

In practice, some authors treated the logarithm of the RNA-seq data as continuous,

though not rigorous. The proposed method provides a justification for this use, which is

necessary and important given the popularity of NGS techniques. As discussed in Liang et al.

(2015), the ψ-learning algorithm provides a general framework for how to integrate multiple

48

Figure 2-7. Log-log plots of the degree distributions of the four networks generated by theproposed method (upper left), gLasso (upper right), nodewise regression (lowerleft), and LPGM (lower right).

sources of data in reconstructing Gaussian graphical networks, where it is proposed to use a

meta-analysis method to combine the ψ-partial correlation coefficients calculated from different

sources of data. Similarly, with the proposed method, we can integrate different types of omics

data, such as RNA-seq and microarray data, to improve inference for gene regulatory networks.

We expect that this method will be widely used in near future.

Finally, we note that alternative to the LPGM method, an existing method that can

potentially be used for Poisson graphical modeling is the latent copula Gaussian graphical

modeling method (Hoff, 2007; Dobra et al., 2011). The basic idea of this method is to

49

introduce Gaussian latent variables in place of discrete random variables in the Poisson network

inference. Since the method involves imputation for a large number of latent variables, it is

very slow and can only be applied to the problem with a small set of genes.

50

CHAPTER 3BAYESIAN NETWORKS FOR MIXED DATA

3.1 Introduction

We propose a new method for learning high-dimensional Bayesian networks. The

proposed method belongs to the category of constraint-based methods and can be viewed

as an extension of the ψ-learning method to Bayesian networks but with a special care

for v -structures. The proposed method consists of three stages namely, moral graph

learning, v -structure identification, and derived direction identification. It is to first learn

the moral graph of the Bayesian network using the ψ-learning algorithm, and then identify

the v -structures contained in the network based on conditional independence tests, and

finally identify the derived directions for non-convergent edges according to logical rules. The

moral graph, which is formally defined in Section 3.2, can be viewed as a Markov network

representation of the Bayesian network. The consistency of the three-stage method is justified

as small-n-large-p scenario. To illustrate the generality of the three-stage method, it is applied

to a variety of examples with mixed data, i.e., those consisting of both discrete and continuous

variables. The numerical results indicated that the proposed method significantly outperformed

the existing methods, including the PC algorithm. Under the sparsity assumption, the proposed

method has a computational complexity of O(p22m−1), while the computational complexity

of the PC algorithm is O(p2+m), where m is the maximum size of the Markov blanket of each

node.

The mixed data here is restricted to those consisting of Gaussian and multinomial/binomial

variables only. In this scenario, the joint distribution of mixed variable is well defined, see Lee

& Hastie (2013), for which the conditional distribution of each continuous variable given the

rest is still Gaussian and the conditional distribution of each discrete variable given the rest is

still multinomial. Therefore, all conditional independence tests involved in the proposed method

can be conducted under the framework of generalized linear models (GLMs). Extension of the

proposed method to other types of mixed data will be discussed in section 3.6.

51

3.2 A Brief Review of Bayesian Network Theory

In this section, we give a brief review for the Bayesian network theory required by the

article. For a full account of the theory, please refer to Nielsen & Jensen (2009) and Scutari &

Denis (2014).

As mentioned in Chapter 1, a Bayesian network can be represented by a directed acyclic

graph (DAG) G = (V,E), where V, with a slight abuse of notation, denotes a set of p nodes

corresponding to the p variables X1, ...,Xp, and E = (eij) denotes the adjacency matrix or arc

sets. The joint distribution of X1, ...,Xp is given by

P(X) =∏i

q(Xi |Pa(Xi)), (3–1)

where Pa(Xi) denotes the parent nodes/variables of Xi in the network, and q(·|·) specifies the

conditional distribution of Xi given the its parents nodes. In Bayesian network, each node Xi is

conditionally independent of its non-descendants(i.e. the nodes for which there is no path to

reach from Xi) given its parents. This is so-called local Markov property of Bayesian networks.

The local Markov property implies that the parents are not completely independent from their

children in the Bayesian network. With Bayes’ theorem, it is easy to show how information on

a child can change the distribution of parent. A convergent connection Xi → Xk ← Xj is called

a v -structure if there is no arc connecting Xi and Xj . In addition, Xk is often called a collider

node, and the convergent connection is than called an unshielded collider. The v -structure

enables Bayesian networks to represent a type of relationship that Markov networks cannot,

that is, Xi and Xj are marginally independent while they are dependent conditioned on Xk .

The Markov blanket of a node Xi , is the set of consisting of the parents of Xi , the children

of Xi , and the spouse nodes that share a child with Xi . The Markov blanket of a node Xi ∈ V

is the minimal subset of V such that Xi is independent of all other nodes conditioned on it.

The Markov blanket is symmetric, i.e., if node Xi is in the Markov blanket of Xj , then Xj is

also in the Markov blanket of Xi .

52

If the directions of all arcs in a Bayesian network are removed, the resulting undirected

graph is called the skeleton of the Bayesian network. Note that we can have Bayesian networks

with different arc sets that encode the same conditional independence relationships and

represent the same joint distributions. To illustrate this issue, we can consider the following

identity

P(Xi)P(Xj |Xi)P(Xk |Xj) = P(Xi |Xj)P(Xj)P(Xk |Xj),

where the left represents the serial connection Xi → Xj → Xk , and the right represents the

divergent connection Xi ← Xj → Xk . Such two Bayesian networks are said to belong the same

equivalent class. Two DAGs defined over the same set of variables are equivalent if and only if

they have the same skeleton and the same v -structure. Hence, in Bayesian networks, only the

directions of the arcs that are part of one or more v -structures are important.

The moral graph is an undirected graph that is constructed by (i) connecting the

non-adjacent nodes in each v -structure with an undirected arc, and (ii) ignoring the directions

of other arcs. This transformation is called moralisation, which provides a simple way ti

transform a Bayesian network into the corresponding Markov network. In the Markov network,

all dependencies are explicitly represented, even those that would be implicitly by v -structures

in Bayesian network. In the moral graph, the neighboring set of each nodes forms its Markov

blanket.

Finally, we given the definition of faithfulness of graphical models. Let M denote

the dependence structure of the probability distribution of X, i.e., the set of conditional

independence relationships between any triplet A,B,C of subsets of X. The graph G is said to

be faithful or isomorphic to M if for all disjoint subsets A,B,C of X, we have

X ⊥P B|C ⇐⇒ A ⊥G B|C (3–2)

where the left denotes the conditional independence in probability, and the right denotes the

separation in graph (i.e., C is a separator of A and B). For a Markov network, C is said to

be a separator of A and B if for every a ∈ A and b ∈ B, all paths from a to b have at least

53

one node in C. For Bayesian network, C is said to be separator of A and B if along every

path between a node in A and a node in B there is a node v satisfying one of the following

condition: (i) v has convergent arcs and neither v nor any of its descendant are in C, and (ii)

v is in C and does not have converging arcs. The faithfulness provides a theoretical basis for

establishing consistency for constraint-based methods.

3.3 Learning High-Dimensional Bayesian Networks

Based on the theory of Bayesian networks, we propose a three-stage method to learn the

structure of high-dimensional Bayesian networks: (i) learning the moral graph, (ii) identifying

v -structures, and (iii) identifying derived directions. Upon completion, the first two stages will

result in a partially directed acyclic graph (PDAG), which falls into the equivalent class of the

final Bayesian network. The third stage will identify the derived directions of non-convergent

edges based on some logical rules. The direction of an edge is said to be derived when there is

a logical consequence of previous actions.

3.3.1 Learning the Moral Graph

Under the assumption of faithfulness, the moral graph can be learned via conditional

independence tests Xi ⊥P Xj |Sij \ Xi ,Xj for all ordered pair of (i , j), where Sij denotes

the Markov blanket of Xi or Xj . If the conditional independence is true, then there is no arc

between Xi and Xj . Otherwise, Xi and Xj are in each other’s Markov blanket.

In the literature, quite a few algorithms have been proposed for learning Markov

blankets, e.g., the grow-shrink Markov blanket (Margaritis, 2003) and incremental association

(Tsamardinos et al., 2003b; Yaramakala & Margaritis, 2005) algorithms. The grow-shrink

Markov blanket algorithm works like a forward selection procedures, which first continues

to add new variables to conditioning set (starting with an empty set) until the conditional

independence holds or there are no more variables to add, and then shrinks the conditioning

set by removing the variables outside the blanket. The incremental association algorithm is

an enhancement of the grow-shrink Markov blanket algorithm, which reduces the number

of conditional tests by arranging the order of the variables to add to the conditioning set.

54

A fundamental problem with these algorithms is that they often need to perform some

conditional tests with the size of the conditioning set close to p. When p is greater than n,

such tests cannot be carried out or are very unreliable. Their computational complexity is

O(p2+a) for some 0 < a ≤ 1, where the factor pa accounts for the number of conditional

independence tests performed for each of p2 pairs of nodes. In the worst case that the graph is

fully connected, a is equal to 1 for all algorithms.

In what follows, we present a new algorithm for learning moral graphs, which can work

under the scenario n ≪ p, and has a computational complexity of O(p2) even in the worst

case. Instead of identifying the exact Markov blanket for each node, we propose to identify

a super Markov blanket Si for each node Xi such that Si ⊆ Si holds, where Si denote the

Markov blanket of the node Xi . Let ϕij denote the output of the conditional independence test

Xi ⊥P Xj |S \ Xi ,Xj, i.e., ϕij = 1 if the conditional independence test holds and 0 otherwise.

Let ϕij denote the output of the conditional independence test Xi ⊥P Xj |Si \Xi ,Xj. Theorem

3.1 shows that, under the faithfulness assumption, ϕij and ϕij are equivalent in learning moral

graphs.

Theorem 3.1. Assume the faithfulness holds. Let Si denote the Markov blanket of Xi , let Si

denote a superset Si . Then ϕij and ϕij are equivalent in learning moral graphs in the sense that

ϕij = 1 ⇐⇒ ϕij = 1.

Proof. If ϕij = 1, then Si\Xi ,Xj forms a separator of Xi and Xj . Since Si ⊂ Si , Si\Xi ,Xj.

By faithfulness, we have ϕij = 1.

On other hand, if Si = 1, then Xi and Xj are conditionally independent and Si \ Xi ,Xj

forms a separator of Xi and Xj . Since Si ⊂ V, V \ Xi ,Xj is also a separator of Xi and Xj

and the conditional independence Xi ⊥P Xj |V \ Xi ,Xj holds. By the total conditioning

property (property 7 in Pellet & Elisseeff (2008)), which shows that Xj ∈ Si ⇐⇒ Xi ⊥P

Xj |V \ Xi ,Xj, we have Xj ∈ Si . Therefore, ψij = 1 holds.

55

By the symmetry of Xi and Xj , Theorem 3.1 also holds if Si is replaced by Sj and Si

is replaced by Sj . Although ψij and ψij are equivalent in learning moral graphs, the size

of the supper Markov blanket Si should be as small as possible considering the power of

the conditional independence tests. A large Si often reduces the power of the conditional

independence test.

Based on Theorem 3.1, we propose the so-called p-screening algorithm for learning moral

graphs, which provides an efficient way to learn Markov blanket for each node simultaneously.

The algorithm consists of the following steps:

Algorithm 3.1. p-learning algorithm

(a) (Screening for parents and children nodes) Find a superset of parents and children foreach node Xi :

(i) For each unordered pair of nodes (Xi ,Xj), i , j = 1, 2, ..., p, conduct the marginalindependence test Xi ⊥P Xj and obtain the p-values.

(ii) Conduct a multiple hypothesis test to identify the pairs of nodes that are depen-dent. Denote the superset by Ai for i = 1, .., p. If the size of A + i is greaterthan n/(cn1log(n)) for a pre-specified constant cn1, reduce it to n/(cn1log(n)) byremoving the variables having large p-values in the marginal independence tests.

(b) (Spouse nodes amendment) For each node Xi , find the spouse nodes that are notincluded in Ai , i.e., finding the set Bi = Xj : Xj ∈ Ai ,∃Xk ∈ Ai ∩ Aj fori = 1, .., p, where Xj is a node not connected but sharing a common neighbor with Xi .If the size of Bi is greater than n/(cn2log(n)) for a pre-specified constant cn2, reduceit to n/(cn2log(n)) by removing the variables having larger p-values in the spouse testXi ⊥P Xj |Xk .

(c) (Screening for the moral graph) Construct the moral graph based on conditional indepen-dence tests:

(i) For each ordered pair of nodes (Xi ,Xj), i , j = 1, 2, ..., p, conduct the conditionalindependence test Xi ⊥P Xj |Sij\i , j, where Sij = |Ai∪Bi\i , j| ≤ |Aj∪Bj\i , j|and Sij = Aj ∪ Bj otherwise.

(ii) Conduct a multiple hypothesis test to identify the pairs of nodes for which they areconditionally dependent, and set the adjacency matrix Emb accordingly, where Embdenotes the adjacency matrix of the moral graph.

56

As annotated in the Algorithm 3.1, step (a) is to find a superset of parents and children

for each node. As pointed out in the Appendix, Ai also contains the spouse nodes that are

marginally dependent with Xi . Step (b) is to find the spouse nodes that are not included in

the superset Ai , i.e., the nodes that are marginally independent of Xi , but dependent with Xi

conditioned on their common child. Then, for each node Xi , we have Si ⊂ Ai ∩ Bi . Hence, we

can set set Si = Ai ∩ Bi . It follows from Theorem 3.1 that this algorithm is valid from learning

moral graph.

The multiple hypothesis tests involved in the algorithm can be done using an empirical

Bayes method developed in Liang & Zhang (2008). The advantage of this method is that

it allows for the general dependence between test statistics. Other multiple hypothesis tests

which accounts for the dependence between test statistics, e.g. Benjamini et al. (2006), can

also be applied here. The performance of multiple hypothesis tests depend on their significant

levels. Following from Theorem 3.1, a slightly large value of α1 should be used to reduce the

risk of Si ⊆ Ai ∩ Bi . On the other hand, the power of conditional independence tests in step

(c) is adversely affected by the size of the superset Si and thus by the value of α1. However,

we also found that such an effect is not very sensitive to the size of Si ; including a few extra

variables in Si will not hurt much the power of the moral graph screening tests. To balance

the two ends, we suggest to set α1 = 0.1 or 0.2. Throughout examples of this paper, we set

α1 = 0.2 and α2 = 0.05, unless otherwise stated.

In the algorithm, we have restricted the sizes of Ai and Bi based on the sparsity

assumption, given by Condition (C) of Section 3.2.4, for the high dimensional Bayesian

network. By assuming that each conditional distribution q(·) in (3–1) can be represented

by the probability distribution function of a normal linear regression or multiclass logistic

regression, we are able to bound the size of each set Ai by O(n/log(n)) based on the theory

of sure independence screening (Fan & Lv, 2008; Fan et al., 2010). Refer to the Appendix

B for the detail of the theoretical development. Further, under the sparsity assumption, we

are also able to bound the size of each set Bi by O(n/log(n)). Therefore, the size of each

57

superset Si = Ai ∩ Bi can be bounded by O(n/log(n)). With appropriate choices of cn1 and

cn2, we can always have |Si | < n holding for all i = 1, 2, ..., p when n is reasonably large. In

this paper, we set cn1 = cn2 = 1 for all examples. In practice, when the sample size n is small,

even the size of Bi is smaller than the pre-specified threshold, we might still conduct spouse

tests to reduce its size further. Since the size of Si adversely affects on the power of the moral

graph screening test, a smaller Bi is always preferred.

Since both the marginal test in step (a) and the conditional independence tests in step

(c) only need to be performed once for each ordered pair of nodes, and the multiple hypothesis

tests can be done in a linear time of the total number of the p-values, the computational

complexity of the p-screening algorithm is O(p2), which is independent of the underlying

structure of the Bayesian network.while, in the worst case, the computational complexity of the

existing algorithm is O(p3).

3.3.2 Identifying v -structures

Given the moral graph, the v -structure contained in the Bayesian network can be

identified by performing further conditional independence tests around each variable. With

the identified v -structures, the Markov blankets can be resolved by deleting the spouse links

and orienting the arcs in v -structures, the Markov blankets can be resolved by deleting the

spouse links and orienting the arcs in v -structure. This step can be be accomplished using

some existing algorithms, such as the collider algorithm (Pellet & Elisseeff, 2008) or the local

neighborhood algorithm (Margaritis & Thrun, 1999). In this paper, we adopted the collider set

algorithm and also made theoretical justification for the consistency of the algorithm under the

small-n-large-p scenario. Refer to Theorem B.2 in the Appendix B for detail.

According to the theory of Bayesian networks, only the arcs in v -structures can be

oriented. From that the moral graph is correct, only triangles can hide spouse links and

v -structures. For three nodes Xi ,Xj and Xk , they form a triangle if the edges Xi − Xj , Xi − Xk

and Xj − Xk all exists. Let Tri(Xi ,Xj) (with Xi , Xj ∈ V and (Xi ,Xj) ∈ E) = Xk ∈

V |(Xi ,Xk) ∈ Emb, (Xj ,Xk) ∈ Emb denote the the set of nodes that form a triangle with

58

Xi and Xj in the moral graph. Tri(Xi ,Xj) also denotes the interaction set of the Markov

blankets of Xi and Xj . Note that two spouses Xi and Xj that are not linked in the true graph

can be separated by some sets of nodes. Thus, if we can find a set Dij that makes Xi and

Xj conditionally independent, then the link between them is a spouse link to be removed.

Therefore, any node Xk ∈ Tri(Xi ,Xj) \ Bij is a collider and thus a common child and that

triplet (Xi ,Xj ,Xk) forms a v -structure Xi → Xk ← Xj . Let MB(·) denote Markov blanket

information for each node Xi ∈ V and BD(Xi) denote the boundary of Xi ; which is the set of

directed neighbor is graph G.

We set PDAG as the moral graph according to MB(·) and D as an empty list of

orientation directives. Then the collider set algorithm Pellet & Elisseeff (2008) is as follows.

Algorithm 3.2. (Collider set algorithm)

(a) For each edge Xi − Xj part of a fully connected triangle

(i) B is set to bethe smallest set of BD(Xi) \ Tri(Xi − Xj) \ Xj ,BD(Xj) \ Tri(Xi −Xj) \ Xi.

(ii) For each S ⊂ Tri(Xi − Xj) , Xk = B ∪ S . If Xi and Xj is condition-ally independent given Xk , then Sij = Xk , go to step (b). Set D to be B ∩nodes reachable by W in V \ i , j|W ∈ Tri(Xi ,Xj) \ S and B ′ to be B \D. Foreach S ′ ⊂ D, Xk = B ′ ∪ S ′ ∪ S . If Xi and Xj is conditionally independent given Xk ,then Sij = Xk , go to step (b).

(b) If Sij is not empty, mark link Xi − Xj as spouse link. For each Xk ∈ Tri(Xi − Xj) \ Sij ,D = D ∪ (Xi → Xk ← Xj).

(c) Remove all spouse links from graph G.

(d) For each orientation derivative (Xi → Xk ← Xj) ∈ C , if edges Xi − Xk and Xj − Xk stillexist in G, then orient edge as Xi → Xk ← Xj .

Step (a.ii) are based on two caveats during the collider set search. Firstly, there might be

d-connecting paths between Xi and Xj that are not going through any node of Tri(Xi − Xj).

Those nodes must be appropriately blocked. Secondly, the base conditioning set must be

checked not to include any descendants of possible colliders. Since no descendants of a collider

can be included in the separator set of Bayesian networks.

59

The complexity of the whole algorithm iterating over all triangle links, in terms of number

of conditional independence test, is O(pm2m−1), where m is the maximum size of the Markov

blanket of each node. The factor pm represents the total number of pairs of candidate spouse

nodes, and the factor 2m−1 represents the maximum number of conditional subsets that are

needed to consider for each pair of candidate spouse nodes. When the network is sparse, the

algorithm can perform reasonably fast. The local neighborhood algorithm (Margaritis & Thrun

(1999) has the same computational complexity. Pellet & Elisseeff (2008) pointed out that the

collider set algorithm has two major benefits. One is related to the triangle search. Given the

Markov blanket information is correct, only triangles can hide spouse links and v -structure

information. The other one is that for each connected pair Xi − Xj in a triangle, decisions

about spouse links and edge orientation are considered at the same time and thus faster.

3.3.3 Identifying Derived Directions

Upon completion of stage (ii) of Algorithm 3.1, the skeleton and colliders of the Bayesian

network can be identified; that is, we can get a PDAG in the equivalent class of the Bayesian

network. Given the skeleton and colliders, a maximally directed Bayesian network can be

obtained following the four necessary and sufficient rules: e.g., Verma & Pearl (1991), Meek

(1995) and Kjaerulff & Madsen (2008), which ensure that no cycles and additional colliders are

created in the graph.

(A) Since Xi → Xj − Xk is not a valid v -structure, Xj → Xk must be directed.

(B) Given the edges Xj → Xj → Xk , directing from Xk to Xi will produce a directed cycle, soXi → Xk must be directed.

(C) Since directing the edge Xj → Xi will inevitably produce an additional collider Xl →Xi ← Xk or a directed cycle, Xi → Xj must be directed.

(D) Since directing Xj → Xi will inevitably produce an additional collider Xj → Xi ← Xl or anadditional collider Xj → Xi ← Xl or a directed cycle, Xi → Xj must be directed.

These four rules can be repeatedly used until no edge can be directed. By the repeated

application, all edges common to the equivalence class of the Bayesian network can be

identified. The remaining edges may be directed using expert knowledge. Alternatively, an

60

A

B

C

D

Figure 3-1. Four necessary and sufficient rules for derived directions, where the indices I , J, Kand L are used to present the nodes Xi , Xj and Xk and Xl , respectively.

61

optimization procedure, such as simulated annealing (Kirkpatrick et al., 1983) and stochastic

approximation annealing (Liang et al., 2014), might be applied to direct the remaining edges

such that a selected score function is optimized.

3.3.4 Consistency of the Proposed Method

This subsection established the consistency of the proposed method; that is, the proposed

method is able to identify a PDAG in the equivalence class of the true Bayesian network as the

sample size n becomes large. To achieve the goal, we assume that the joint distribution (3–1)

of the Bayesian network can be re-expressed in the form

p(x, y|Θ) ∝

−12

pc∑s=1

pc∑t=1

θstxsxt +

pc∑s=1

νsxs +

pc∑s=1

pd∑j=1

ρsj(yj)xs +

pd∑j=1

pd∑r=1

ψrj(yr , yj)

(3–3)

where xs denotes the sthe pf pc continuous variables, and yj denotes the jth of the pd discrete

variables. The joint model is parametrized by Θ = [θst, νs, ρsj, ψrj]. As shown in Lee

& Hastie (2013), the conditional distribution of 3–3 are given by Gaussian linear regression and

multiclass logistic regressions. Therefore, all the conditional independence tests conducted in

the moral graph learning and the v -structure identification stages are well defined, which are

equivalent to test whether or not the corresponding regression coefficients equal to zero. To be

specific, the test in step (a) of the p-screening algorithm is equivalent to test the coefficient of

Xj in the GLM

Xi ∼ 1 + Xj , (3–4)

the test in step (c) of the p-screening algorithm is equivalent to test the coefficient of Xj in the

GLM

Xi ∼ 1 + Xj +∑k∈Sij

Xk , (3–5)

and the test in the v -structure identification stage is equivalent to test the coefficient of Xj in

the GLM

Xi ∼ 1 + XJ +∑j∈Dij

Xk , (3–6)

where Dij denotes a subset of BD(Xi) \ Xj.

62

Under the GLM assumption, the consistency of the proposed method can proved based on

the theory of sure independence screening established in Fan et al. (2010), the theory of the

ψ-learning algorithm established in Liang et al. (2015), and the theory established in Kalisch

& Buhlmann (2007) for the PC algorithm. Parallel to the conditioned assumed by the PC

algorithm for the Gaussian case, we assume the following conditions:

(A) (Faithfulness) The Bayesian network is faithful, for which the joint distribution can beexpressed in a Gaussian-multinomial distribution (3–3).

(B) (High dimensionality) The dimension pn = O(exp(nδ)), where 0 ≤ δ < (1−2κ)α/(α+2)

for some positive constants κ < 1/2 and α > 0, and the subscript n of pn indicates thedependence of the dimension p on the sample size n.

(C) (sparsity) The maximum size of the Markov blanket of each node, denote by qn =max1≤j≤pn |Si |, satisfies qn = O(nb) for some constant 0 ≤ b < (1 − 2κ)α/(α + 2),where Si denotes the Markov blanket of node i .

(D) (Identifiability) The regression coefficients satisfy

inf|βij |C; βij |C ≤ 0, i , j = 1, 2, ..., pn,C ⊆ 1, 2, ..., pn\i , j, |C| ≤ O(n/log(n)) ≥ c0n−κ,(3–7)

Since the stage of moral graph learning works based on the theory of sure independence

screening, we follow Fan et al. (2010) to given some conditions for GLMs (see Appendix B for

the detail) such that the resulting Bayesian network satisfies the sparsity condition (C). Fan

et al. (2010) showed that variable screening can be done in regression coefficient or in p-values

of the conditional independence tests, which are equivalent to each other. For this reason, the

identification condition (D) is given in terms of regression coefficients. Under these conditions,

we showed in Appendix that the proposed method is consistent, i.e., P(E(n)

mb = E(n)mb) → 1 and

P(E(n)

v = Ev |E(n)

mb = E(n)mb) → 1 as n → ∞, where E (n)mb denotes the adjacency matrix of the

moral graph. Ev denotes the set of v -structures and E(n)

mb and Ev denote the estimators of E (n)mb

and Ev obtained by the proposed method, respectively.

63

Figure 3-2. A smaller version of the graph structure underlying the simulation study, where thecircle nodes represent Gaussian variables, the square nodes represent Bernoullivariables, and the solid, dotted and dashed lines represent three different types ifedges.

Table 3-1. Outcomes of binary decision.

Actual Positive(P) Actual Negative(N)Predicted Positive True positive (TP) False positive(FP)Predicted Negative False negative (FN) True negative (TN)

3.4 Simulation Studies

3.4.1 Mixed Data for an Undirected Graph

This example was modified from Lee & Hastie (2015), which consists of two types of

variables, Bernoulli and Gaussian. Figure 3.4.1 shows a smaller version of the structure of the

underlying graph. The data were simulated under two setting (n, pc , pd) = (500, 100, 100)

and (100, 100, 100), where n denotes the sample size. pc denotes the number of Gaussian

variables, and pd denotes the number of Bernoulli variables. Under each setting, 10 datasets

were simulated independently.

The proposed three-stage method was first applied to this example. Since the true graph

is undirected, only the skeleton was outputted; that is, the directions of the edges of the

resulting Bayesian network were ignored. Figure 3.4.1 was drawn by varying the significant level

α2, while fixing the significant level α1 = 0.02 and α3 = 0.05.

Lee & Hastie (2015) proposed to recover the underlying graph structure by maximizing

the penalized pseudo-likelihood function. The pseudo-likelihood function is defined as the

product of the conditional likelihood functions as in Besag (1974), and the penalty terms

64

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Prec

isio

n

partial p−value algoritm with α1 = 0.2pseudolikelihood estimates of mixed graphical model

three stage methodpseudolikelihood estimates method

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Prec

isio

n

partial p−value with α1 = 0.2pseudolikelihood estimates of mixed graphic model

three stage methodpseudolikelihood estimates method

Figure 3-3. Precision-Recall curves produced by the three-stage and pseudo-likelihood methodfor two mixed datasets: the left generated under the setting(n, pc , pd) = (500, 100, 100) and the right under the setting(n, pc , pd) = (100, 100, 100).

Table 3-2. Average areas under the Precision-Recall curves produced by the three-stage andpseudo-likelihood methods. The number in parenthese represents the standarddeviation of the areas averaged over 10 datasets.

(n, pc , pd) three stage method pseudo-likelihood(500, 100, 100) 0.9970 0.9428

(8.22× 10−4) (9.49× 10−7)(100, 100, 100) 0.7493 0.4725

(0.012) (1.58× 10−4)

used there vary for scalars, vectors and matrices from l1 norm to l2 norm to Frobenius

norm. For simplicity, we call their method the pseudolikelihood method. For comparison,

the pseudo-likelihood method was applied to this example. The resulting precision-recall curves

for the two datasets were also shown in Figure 3.4.1. The resulting precision-recall curves

for the two dataset were also shown in Figure 3.4.1. The curves were drawn by varying the

regularization parameter from 0 to 8. The comparison indicates that the three-stage method

significantly outperforms the pseudo-likelihood method for this example.

Table 3-2 summarizes the areas under the precision-recall curves produced by the two

methods for the 20 simulated datasets. Both methods work stably for different datasets,

as indicated by their small standard deviations. The comparison shows that the three-stage

65

method significantly outperforms the pseudo-likelihood method, especially under the small

n-large-p-scenario (i.e., n < pc + pd).

3.4.2 Mixed Data for a Directed Graph

This example illustrates the performance of the three-stage method for learning Bayesian

networks with mixed data, along with comparisons with a variety of existing methods.

Following Kalisch & Buhlmann (2007), we simulated the mixed data in the following procedure:

(i) Fix an order of variables; (ii) randomly mark half of the variables as continuous and the

rest as binary; (iii) fill the adjacency matrix E with zeros, and replace the lower triangle (below

the diagonal) of E with independent realizations of Bernoulli random variables generated with

a success probability s ; and (iv) generate the data according to the adjacency matrix in a

sequential manner.

For this example, this variable X1, which corresponds to the first node of the Bayesian

network, was generated through a Gaussian random variable Y1 ∼ N(0, 1). We set X1 = Y1

if X1 was set to be continuous, and X1 ∼ Binomial(n, 1/(1 + e−Y1)) otherwise. The other

variables X ′i s, i = 2, 3, ..., p, were then sequentially generated by setting

Yi =

i−1∑k=1

0.5EikXK + ϵi , (3–8)

Xi =

Yi , if Xi is continuous,

Binomial(n, exp(Yi )1+exp(Yi )

), if Xi is binary,

(3–9)

where ϵ1, ..., ϵp are iid standard Gaussian random variables, and Eik denotes the (i , k) entry of

E. The success probability s used in step 3 controls the sparsity of Bayesian network. In our

simulations, we set s = 0.02. Let pc and pd denote the numbers of continuous and discrete

variables, respectively. In our simulations, we fix pc = pd = 50, while varying the sample size n

at four levels n = 100, 500, 1000 and 3000. For each value of n, ten datasets were simulated

independently.

66

The three-stage method was first applied to this example with the default setting α1 =

0.2, α2 = 0.05 and α3 = 0.05. Figure 3.4.2 shows the Bayesian network obtained by the

three-stage method with a dataset of size n = 3000, where the edge with double directions

mean that the direction of the edge is undetermined. Compared with the true Bayesian network

(shown on Figure 3.4.2), it is easy to see that many of the identified edges, including the

directions, are correct. For example, in the true Bayesian network the node 16 has 4 parents,

92, 71, 53 and 78, all of them were correctly identified by the three-stage method. Similarly,

the local structure around the node 52 was correctly recovered, and all the parent nodes 4, 56

and 99 of the node 100 were correctly identified.

67

Table 3-3. Average precision and recall of the directed graphs produced by three-stage, PC, HC and MMHC algorithms. Thenumber in parentheses represents the standard deviations of the value averaged over 10 datasets.

(n, p, q) P-screening algorithm PC algorithm HC algorithm MMHC algorithm GS algorithmPrecision Recall Precision Recall Precision Recall Precision Recall Precision Recall

(100, 50, 50)0.562 0.198 0.276 0.056 0.184 0.268 0.458 0.185 0.243 0.018(0.018) (0.016) (0.036) (0.007) (0.010) (0.013) (0.018) (0.007) (0.040) (0.004)

(500, 50, 50)0.665 0.737 0.607 0.346 0.499 0.482 0.652 0.389 0.270 0.016(0.011) (0.009) (0.010) (0.013) (0.019) (0.019) (0.024) (0.016) (0.035) (0.005)

(1000, 50, 50)0.702 0.866 0.619 0.463 0.561 0.533 0.665 0.450 0.450 0.038(0.008) (0.010) (0.012) (0.006) (0.020) (0.022) (0.024) (0.022) (0.060) (0.006)

(3000, 50, 50)0.740 0.990 0.725 0.627 0.568 0.561 0.689 0.539 0.385 0.049(0.015) ( 0.003) (0.011) ( 0.009) (0.024) (0.020) (0.030) (0.020) (0.039) (0.008)

68

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

7879

80

81

82

83

84

85

86

87

8889

90

91

92

93

94

95

96

97

98

99

100

Figure 3-4. The true directed network for a dataset with n = 3000 samples.

Table 3-3 summarizes the precision and recall values for the PDAGs produced by the

three-stage methods, where the PDAG refers to the network obtained at second stage

for which only the skeleton and v -structures are identified. For comparison, a variety of

existing methods, including PC (Spirtes et al., 2000; Kalisch & Buhlmann, 2007), hill-climbing

(HC) (Bouckaert, 2001), max-min hill climbing (MMHC) (Tsamardinos et al., 2006), and

grow-shrink (GS) (Margaritis, 2003), were applied to this example. All these methods have

been implemented in the R package pcalg or bnlearn. Among these methods, PC and GS

belongs to the class of constraint-based methods, HC belongs to the class of score-based

methods, and MMHC belongs to the class of hybrid methods.

The comparison indicated that for this example, the three-stage method is superior to the

existing methods in both precision and recall. In particular, when the sample size is moderately

69

1

2

34

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

3435 36

3738

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98 99

100

Figure 3-5. The estimated directed network for a dataset with n = 3000 samples.

large, say n = 500 and 1000, the three-stage method produced both good precision and good

recall values. The existing methods can be much inferior to it under this scenario.

3.5 Real Data Analysis

3.5.1 Lung Cancer Genetic Network

This study aims to learn an interactive genetic network for lung Squamous Cell Carcinoma

(LUSC), which incorporates both gene expression information (mRNA-array data) and

mutation information. The dataset was downloaded from The Cancer Genome Atlas (TCGA)

at http://tcga-data.nci.nih.gov/tcga/. The original mRNA data contains 17814 genes and 154

patients, and the original mutation data contains 14873 genes and 178 patients. We filtered

the mRNA data by including only 1000 genes with most variable expression values, and filtered

mutation data by including only 102 genes for which the mutation occurred in at least 15% of

70

KLHL13

BANK1

CXCL13

FOXE1

EMX2

ADAM23

KRT15

TMSL8

tcag7.1260

KRT34

NTRK2

MGC39715

ROPN1B

TXNRD1

LOC253970

CA8

EIF1AY

SAA4

SERPINA3

APOC1

KLRB1

DMRT1

AKR1B10

POU2AF1

UGT2B15

PAGE2B

SV2B

SERPINB2

C20orf114

SPON1

PLA2G4A

SLC26A4

EREG

CAMP

GRHL3

PTX3

C7

DYNLRB2

OASL

FLJ38723

ARMC3

EDIL3

LOC90925

DSC3

DMKN

NUDT11

FOS

RBM11

WNT2B

CXCL1

POPDC3

DAPL1

CYP39A1

PON3

RP1336C9.6

KIAA1622

SRD5A2

CSTA

IL1A

FAM3B

ROPN1

SPP1

GLI1

FAM81B

PTGS2

ADH1C

RPS4Y1

PI3

BNC1

SCGB1A1

MYEOV

CBR3

AQP3

KRT6C

LINGO2

CCDC38

FABP6

CLDN1

GSTA5

ATP8A2

PSCA

LEPREL1

SOX2

KRT75

ADAMDEC1

FBN2

SPRR2D

SEMA3E

LRRC17

MAGEA9

NPTX2

FAM80A

ARMC4

CLCA2

DEFA3

CDH16

JARID1D

CRNN

KLHL29

KRTDAP

SLC34A2

COL10A1

GPR87

KLK11

HS3ST3A1

AQP9

C1orf61

DSCR6

BCHE

TSPAN7

SFRP2

FAM9B

IL8

SLC44A5

GPX2HRASLS

MMP1

ASPN

MMP3

OGN

KRT16

VCX3A

DDX43

HS3ST2

PAMCI

S100A7

SERPINB3

GULP1

COL12A1

CCNA1

ITGB6

ACTBL1

EBF3

DOK6

FAM30A

KRT6A

DPT

GPR65

KIAA1822L

KRTAP31

HOXC9

SMC1B

DNAH9

CFTR

LMO3

KLK10

INDOL1OVOS2

LCN2

F3

CDKN2A

INDO

SCGB3A2

PTGIS

KRT23

HOXD11

FPRL1

FIGF

FABP5

FGFR3

FLJ35880

EGF

ABCC2

KIT

LUZP2

LASS3

GAST

FHOD3

IL19

FLJ45803

GSTM1

OSGIN1CYP4F2

NAPSA

CTSE

DMRT2

VWC2

SAA1

ATAD4

SLN

SPRR2G

IL1B

CDK5RAP2

PHACTR3

CLEC2B

CXCL11

MGC45438

SCRG1

C4orf19

SPANXD

FMO1

HMGA2

CYP4X1

SDPR

ZNF229

KLK13

IFIT1

MGC16291LTF

DSG3

DKK1

MAGEA1

MAP1B

KIAA0319

EDN2

PPP1R3C

PKD2L1

CYP24A1

TP63

SGEF

EDN1

LGI2

NAT8L

CXCR7

FETUB

FGF12

ZNF750

RBP7

CDKN2B

MME

SOX15

SFTPA1

DPPA2

PAK7

OLFM4

CPA3

GAS2

HEY1

GPR109B

CABYR

MGC10981

LRG1

LDLRAD1

SAC

COL11A1

CGB1

HOXC13

PLUNC

GJB6

KRT13

FAT2

LRRC4

AQP4

LIPH

THBD

ABCC3

FAM83C

S100A8

KHDRBS2

C8orf47

NQO1

GPR34CCL28

ADH1A

GSTM3

MGC102966COL17A1

XG

FABP4EGLN3

PAGE5

ZMAT4

PRAME

AHNAK2

CYP4F11

TCBA1

SFRP1

SCN9A

C19orf59

UGT2B17

CEACAM6

MS4A2

CYorf15B

PENK

ANXA8L2

PNOC

SPRR2F

LOC130576

BMP2

TMPRSS11D

NTS

MMP12

FGFBP1

SLC1A6

CTA246H3.1

HOXD10

ZNF415

SUSD4

AKR1C1

SBSN

HTRA4

VCX2

EYA2

NR0B1

CDH2

P2RY1

DMRTA1

SFRP4

C9orf18

C6orf156

STXBP5L

CLCA4

SRXN1

FLJ35773

SERPINB4

VNN1

ALDH3A1

RIBC2

PTH2R

GJB2

MICALCL

GSTA3

ABCA4

PCP4

TMEM46

SAA2

C20orf82

CLDN18

RNF128

COL6A6

C12orf54

ABCA3

GDA

UGT2B7

CD1A

GSTA2

TOX3

MYBPC1

UNQ9433

NRG4

NOS2A

GLYATL2

MUC15

POTE15

TSPAN19APOBEC3A

NPY

SOST

PAGE2

IL1F9

CBLN2GCLC

SCTR

CCL11

HOXD8

ANXA10

IRX4

RNF152

KIAA1257

PLAC1

GABRA5

MAGEA12

SAGE1

LOC253012

CSAG3A

S100A12

ABCA13

AGR2

CLCA3

PADI3

NUDT10

ARL14

FA2H

SH3GL3

G0S2

ADCY8

IGFL2

IL6

C5orf23

LYPD1

KRT14

MS4A1

CHP2

IGLV214

CXorf57

LCE3D

CYP2C18

IFI44L

SLC47A2

PPP2R2B

NRCAM

GBA3

AKR1C3

CYorf15A

CES1

RPS4Y2

ZNF556

KRT17

HRASLS3

WNT5A

C4orf7

GABRG3

CALB2

MAGEA3

KLK12

CSAG1

PLA2G1B

MUCL1

CT456

MGC9913

CA3

NDRG4

HOXC8

ME1

IVL

AGTR2

MAGEA4

TMPRSS11A

EHF

CTNND2

LGR6

PROM2

CRYM

CYP4F8

FCER1A

NPR3

ABHD7

TMEM20

FLJ21511

MB

CYP4Z1

C2orf54

GBP6

ATP6V1C2

CSN1S1

GLT25D2

HHIPWBSCR28

SPRR3

ART3

PLAC8

SLC14A1

TP53

CDH18

TNR

AHNAK2

NAV3

ANK2

USH2A

OR2T4

CDH12

SPHKAPBIRC6

LOC96610

Figure 3-6. The Bayesian network produced by the three-stage method with the mRNA (circlenodes) and mutation (square nodes) data measured on the same set of 121 LUSC(Lung Squamous Cell Carcinoma)

the patients. We then merges the two data sets together. The merged dataset consists of 121

patients, which are common to both the mRNA and mutation data.

Figure 3.5.1 shows the Bayesian network produced by the three-stage method with

α1 = 0.1, α2 = 0.05 and α3 = 0.05. Since the sample size n = 121 is small relative to the

total number of variables p = 1102, a small value of α1 is used. We expect that such a small

value of α1 will result in a litter smaller superset Si for each node. Further, to improve the

power of the moral graph screening tests, the spouse tests have been performed at significant

level of 0.05. The spouse tests reduced the size of the conditioning sets used in the moral

graph screening tests. The three-stage method cost 1.94 CPU hours on a 3.6GHZ desktop for

71

KLHL13

SALL1

BANK1

CXCL13

BMP3

CPNE4

FOXE1

PGBD5

FNDC1

EMX2

ADAM23CHST6

TMEM125

KRT15

RDHE2

UPK1B

UGT1A8

MS4A8B

TMSL8tcag7.1260

DDIT4L

KRT34

C1orf125

GAL

FAM83A

KIAA1045

NLGN4X

ALOX12

TCEAL2

NTRK2

hCG_1990170

KLK7

C20orf85

KRT4

PCDH20

MGC39715

ROPN1B

COMPTXNRD1

BRDT

LOC253970

SRPX

COL4A6

CAPSL

CA8

SFTPC

EIF1AY

SAA4

SOSTDC1

C4BPA

KCNMB4

SERPINA3

KCNJ16

APOC1

KLRB1

IGFL1

DMRT1

MMP13

RTP3

CLGN

AKR1B10

GALNT13

POU2AF1

UGT2B15

VSNL1

MAG1

PAGE2B

SV2B

CRYBA2

EYA1

SERPINB2

C20orf114

HPR

C18orf2

SPON1

PLA2G4A

7A5 AMPD1

SLC26A4

EREG

CAMP

GRHL3

PTX3

MUC4

C7

DYNLRB2

SLITRK6

OASL

RNF165

FLJ38723

ARMC3

ZDHHC11

EDIL3

LOC90925

DSC3

HTR2B

DMKN

NUDT11

FOS

RBM11

FBXL21

WNT2B

CXCL1BEX1

SLC26A9

TMPRSS11F

FAM133A

FGFBP2

IL33

SEC14L4

KCNMB2

KRT5

PPP1R14C

POPDC3

TM4SF19

SDCBP2

GATM

GAD1

VAV3

DAPL1

CBR1

CYP39A1

GLULD1

PON3

GHR

RP1336C9.6

KRTAP191

KIAA1622

KLRG2

PRSS21

SLC10A4

PKIB

SRD5A2

CSTA

IL1A

FAM3B

ROPN1

LTB4DH

SPP1

GLI1

SFTPD

FAM81B

PTGS2

BMP7

NEFL

ADH1C

SLCO1B1

PGLYRP4

RPS4Y1

C21orf81

PI3

BNC1

PCDHB6

PAGE1

D4S234EZNF365

SCGB1A1

HSD17B3

MYEOV

CBR3

AQP3

KRT6C

LINGO2

CCDC38

SEMA6D

TTC29

WFDC2

FABP6

GABRB3

C12orf56

CLDN1

GSTA5

ATP8A2

PSCA

LEPREL1

ZFP42

SOX2

KRT75

TMPRSS11B

ODZ2

S100P

SLC6A4

ZYG11A

DEFB1

TCHH

ADAMDEC1

TMEM45A

FBN2

SPRR2D

SEMA3E

LRRC17

C10orf81

MAGEA9

NPTX2

DEFB103A

EDN3

S100A14

AMY2A

FAM80A

ARMC4

CLCA2

RHCG

IFNE1

CPA6

DEFA3

PLA2G3 CDH16

FOXA1

JARID1D

CA2PCDH8

CRNN

ECAT8

KLHL29

KRTDAP

WISP3

SCNN1A

CXCL6

SLC34A2

EYA4

COL10A1

PVRL3

GPR87

KLK11

DNER

FMN2

BEX5

SPRR1B

CRCT1

LOC441376NAP1L2 HS3ST3A1

CPXM2

RP1135N6.1

AQP9

LRRN1

CLC

C1orf61

DSCR6

BCHE

BCL2

KIAA1324

TSPAN7

LONRF2

LOC440356

SFRP2

GPR158

OMD

FAM9B

IL8

SERPINB13

LYPD3

IGFBP2

P53AIP1

WIF1

SLC44A5

MUC20

GPX2

AREG

NRXN1

COCH

HRASLS

MMP1

CA12

WNT16

MAGEA10

ASPN

DACT2

MMP3OGN

KRT16VCX3AAMDHD1

C6orf142

DDX43

GAS1

FAM112B

SOHLH1

HS3ST2

GPC3

PAMCI

S100A7

NXF5

SERPINB3

PCOLCE2

C3orf55

HORMAD1

GULP1

CH25H

COL12A1

CCNA1

KLK6

ITLN1

CALB1

C4orf31

MUC16

JAKMIP2

C8orf4

PTHLH

FGB

POU4F1

ATP13A4

RASEF

ZFHX4

CDH18

CHL1

PAGE4

ITGB6

ACTBL1

MATN3

EBF3

TFF2

SMPX

C1orf87

DOK6

FAM30A

GOLSYN

GJB7

KRT6A

C6orf117

GLYATL1

DPT

ACTL8

GPR65

KIAA1822L

IL12RB2

KRTAP31

HOXC9

SMC1B

LEMD1

OXGR1

ALDH1A1

DNAH9

IRX2

TMEM22

CFTR

RTP4

LMO3

C9orf24

SFTPA1B

KLK10

INDOL1

OVOS2

LCN2

F3

CDKN2A

C4orf26

INDO

SCGB3A2

GSTT1

PTGIS

KRT23

HAPLN1

TNIP3

GOLT1A

HOXD11

PCSK1

KRT1

FLJ22655

FPRL1

MSMB

IL1F5

FIGF

FABP5

PNLDC1

FGFR3

FLJ35880

VTCN1

HIST1H1A

MAGEC1

EGF

ABCC2

KIT

DKK4

ACE2 LUZP2

LASS3

GAST

FZD10

FHOD3

IL19

LECT1

CRABP2

FLJ45803

GSTM1

TSKS

POSTN

OSGIN1

CYP4F2

GPR128

WNT2

NAPSA

MGAT4C

CTSE

FLJ46266

DMRT2

VWC2

C20orf103

SAA1ATAD4

GLI2

SLN

RGS17

SPRR2G

IL1B

ATP8B3

PCDHB5

tcag7.1136

C2orf39

MMP10

AMIGO2

RPL39L

ADRA2C

PKP1

CDK5RAP2

PHACTR3

CLEC2B

MAEL

COL21A1

HOXB13

RAPGEFL1

CXCL11

MGC45438

SCRG1

DLX6

IL13RA2

C4orf19

ODAM

GSTO2

SLC16A9

CFHR4

TMEPAI

CDH26

SPANXDCCK

FMO1

SLC35F3

SCGB2A1

SLC6A14

HMGA2

LOC285141

TMEM45B

SESN3

CYP4X1

SDPR

ZNF229

KLK13

IFIT1

PRB1

MGC16291

CXCL9

TPO

LTF

HOXA4

CHIA

CEACAM7

DSG3

DKK1

MAGEA1

MAT1A

MAP1B

KIAA0319

ZBED2

SLC5A1

EDN2

C3orf41

PPP1R3C

SOX11

PKD2L1 CP

CYP24A1

FXYD3

TP63 WDR66

SGEF

BBOX1 EDN1

PTPRR

CAPS

LGI2

NAT8L

C6orf15

SYT13

PPBP

SLITRK4

TFAP2B

C6orf105

CXCR7

FAM79B

KCNK10

FSTL5

CYP3A5

FETUB

IL23A

LOC399947

WDR16

FGF12

FMO9P

CLDN10

ZNF750

RBP7

DSC1

HS3ST5

CDKN2B

MME

SOX15

SFTPA1

DPPA2

DEFB4

SERPINB11

LGALS2

MAGEA8F2RL2

ATP2C2

PNPLA3

TUBB2B

PAK7

OLFM4

TSPY2

SCG2

FMO3

DDX3Y

CPA3

BARX1

GAS2

HEY1

ODC1

GPR109B

TSLP

CABYR

MGC10981

LPL

MAGEA11LRG1

CDH19LDLRAD1

TP53TG3

IQCA

SAC

C11orf41

COL11A1

C20orf174

ICEBERG

CGB1

ST6GALNAC1

MYB

OR6X1

MPPED2

HOXC13

GPNMB

C16orf73

SASP

PLUNC

GJB6KRT13

ATP4A

NEFM

FAT2

LRRC4

AQP4

LIPH

SELETHBD

FLJ39822

BTBD16

KRT24

ABCC3

NLRP7

FAM83C

CRLF1

HSPA1A

UGT1A6

UBD

S100A8

KHDRBS2

TMEM100SPINK2

SPESP1

PDPN

CAPN13

C8orf47

TCN1

INA

COL4A4

NQO1

IFNG

FOXL2

GPR34

KLK5

CCL28

LDHC

ADH1A

GSTM3

MGC102966

COL17A1

XG

MSLN

ELF5

FABP4

CLDN8

EGLN3

PAGE5

AIM2

ZMAT4

KRT6B

IL20RB

AHNAK2

TFPI2

CYP4F11

LHX2

TCBA1

SFRP1

TNNT1

C19orf59

MUM1L1

CYP26A1

PEBP4

UGT2B17

CEACAM6

MS4A2

HS3ST1

SLC6A15

CYorf15B

DMRT3

PPP1R9A

BARX2

PENK

ANXA8L2

OPRK1

PNOC

RPS6KA6

SPRR2F

LOC130576

CYP2C9

RUNX1T1

BMP2

ROPN1L

GAP43

RASGEF1A

TMPRSS11D SLC7A11

NTS

CHIT1

MIA

PCSK2

MMP12

FGFBP1

CHGB

HAS3

SLC1A6

CTA246H3.1

PLAT

HOXD10

FAM132A

STAR

TBX18

ZNF415

SUSD4

FLRT3

LOX

AKR1C1

LYZ

SP8

KCND2

SBSN

MLLT11

TMTC1

C2orf40

HTRA4

VCX2

MAGEB2

EYA2

SPINK1

CILP

AGR3

AADACL2

SLCO1B3

WFDC5

PROK2

DACH1

OSTalpha

CREG2

BDNF

HSPB3 SCIN

LOC124220

NR0B1

CDH2

SERPINB7

STEAP4

P2RY1

ADH7

FHL2

PCDHB2

RAB6B

STK32A

SLPI

DMRTA1

SFRP4

GLI3

C9orf18

MAGEC2

C6orf156

TGM1

SULT1E1

HIST1H1B

ARL9

AFF2

STXBP5L

CLCA4

SRXN1

FLJ35773

STXBP6

CHST9

STRA8

SERPINB4

VNN1

ALDH3A1

RIBC2

PTH2R

GJB2MICALCL

GSTA3

NPPC

SSTR1ABCA4

RP13102H20.1

SPINK6

ATP12A

PIP

PCP4

TMEM46

SAA2

GPR103

C20orf82

OGDHL

CLDN18

RNF128

CD274

COL6A6

RNF183

C9orf47

C12orf54

ABCA3

XK

GDA

UGT2B7

C1QTNF3

CD1A

KLK8

GSTA2

TOX3

MYBPC1

EN1

UNQ9433POF1B

TKTL1

EPGNNRG4

NOS2A

GLYATL2

MUC15

RNF182

SOX9

POTE15

RGS20

CA9

TSPAN19

APOBEC3A

NPY

SOST

ECHDC3

PAGE2

CTCFL

IL1F9

NELL2

POU3F2

GNGT1CBLN2

GCLC

GPR110

AMTN

PSMAL

ZNF695

SCTR

CCL11

HOXD8

ZNF114

NRN1

ANXA10

IRX4

CEACAM5

FUT3

ERP27

RNF152

KIAA1257

PLAC1

C19orf46

PHEX

FGF13

GABRA5

GRP

MAGEA12

SAGE1

LOC253012

CSAG3A

LY6D

LMO1

S100A12

ARSJ

GKN2

SLC5A8

CYP4B1

ABCA13

AGR2

CLCA3

PTN

PADI3

NUDT10

PTPRZ1

ARL14

FAM137A

SUNC1

PAX9

CST1

OLFM1

FLJ44379

PITX2

FA2H

SPOCK3

SH3GL3

ALDH7A1

G0S2

ADCY8

IGFL2

CHODL

IL6

C5orf23

LYPD1

KRT14

KREMEN2 MS4A1

CHP2

CDO1

KCNH8

IGLV214

HOXC12

CXorf57

LCE3D

IL1R2

CYP2C18 IFI44L

TMEM190

NLGN1

RPESP

FAM90A1

CNTNAP2

ABHD12B

SLC47A2

PPP2R2B

NRCAM

GBA3

CRTAC1

AKR1C3

C10orf82

CYorf15A

OCA2

CES1

HSD17B2

ITLN2

WDR69

RPS4Y2

ZNF556

LRRK2

HOXD13

KRT17C6orf159

VIT

HRASLS3

WNT5A

C4orf7

GABRG3

CALB2

MAGEA3

PRDM13

KLK12

ABCA12

CSAG1

EPHB1

CTTNBP2

COL9A1

CALML5

HOTAIR

PLCB4

PLA2G1B

MUCL1

CT456

S100A2

GALNT14

PF4V1

REG1A

MGC9913

KRT7

CXorf48

CA3

NDRG4

NRXN3

HOXC8

VNN2

ME1

IVL

AGTR2

ROS1

C6orf54

MAGEA4

TMPRSS11A

CTNND2

LGR6

FABP7

DSCR8

PROM2

HES5

CRYM

UCHL1

AKAP14

TSPAN8

C6orf150

COX7B2

CYP4F8TRIM43

OR7A5

ZIC1

FCER1A

ARG1

NPR3

PPP2R2C

ABHD7

TMEM20SPRR1A

SCEL

TMPRSS2

FLJ21511

HLADRB6

ATP6V0A4

DGKG

MB

CYP4Z1

DMBT1

NLRP2

NELL1

C2orf54

WNT10A

GBP6

ATP6V1C2

CSN1S1

GLT25D2

HHIP

WBSCR28

SPRR3

ART3

PLAC8

SALL3

SLC14A1

SULT1B1

KLRD1

DAZ4

MLL3

CSMD3

COL11A1

PKHD1

FAT3

ADAM6

CDKN2A

SYNE2

SPTA1

USH2A

OR2T4

HEATR7B2

FBN2

PKHD1L1

MUC5B

DPP10

C1orf173

CTNNA2DNAH9

BIRC6

SLITRK3MUC4

LRFN5

LOC96610

RELN

PDE4DIP

STAB2

Figure 3-7. The Bayesian network produced by MMHC with the mRNA (circle nodes) andmutation (square nodes) data measured on the same set of 121 LUSC (LungSquamous Cell Carcinoma)

this example, and the resulting Bayesian network consists of 693 edges. As shown in Figure

3.5.1, the network contains four clusters, which are centered at the genes POU2AF1, KLK10,

HOXC13 and NPR3, respectively. It is interesting to point out that all these four hub genes

are lung cancer related. For example, Zhou et al. (2016) reported that POU2AF1 functions

in the human epithelium to regulate expression of host defense genes, and Faner et al. (2016)

found that POU2AF1 is a B-cell recruitment and immunoglobulin transcription gene which is

correlated with emphysema severity. KLK10 has been shown to over-expressed in lung cancer,

see e.g., Cantile et al. (2012) and Carvalho et al. (2012). Besides the hub genes, several

significant RNA and mutation interactions have been identified as well. For example, CDH12,

72

KLHL13

SALL1

BANK1

CXCL13

LOC152573

BMP3

CPNE4

FOXE1

NEFH

PGBD5

TNNT3

FNDC1

EMX2

ADAM23

CHST6

TMEM125

KRT15

RDHE2

FST

UPK1B

UGT1A8

MS4A8B

TMSL8

tcag7.1260

DDIT4L

KRT34

C1orf125

GAL

FAM83A

KIAA1045

NLGN4X

ALOX12

TCEAL2

NTRK2

hCG_1990170

KLK7

C20orf85

KRT4

PCDH20

MGC39715

ROPN1B

GABRP

COMP

TXNRD1

BRDT

LOC253970

SRPX

COL4A6

CAPSLCA8

SFTPC

EIF1AY

SAA4

SOSTDC1

C4BPA

KCNMB4

SERPINA3

KCNJ16

APOC1

KLRB1

IGFL1

DMRT1

MMP13

RTP3

CLGNAKR1B10

GALNT13

POU2AF1

UGT2B15

VSNL1

MAG1

PAGE2B

SV2B

CRYBA2

EYA1

SERPINB2

C20orf114

HPR

C18orf2

SPON1

PLA2G4A

7A5

AMPD1

SLC26A4

EREG

CAMP

GRHL3

PTX3

C1orf110

MUC4

C7

DYNLRB2

SLITRK6

OASL

RNF165

FLJ38723

ARMC3

ZDHHC11

EDIL3

LOC90925

DSC3

HTR2B

DMKN

NUDT11

FOS

RBM11

FBXL21

WNT2BCXCL1

BEX1

SLC26A9

TMPRSS11F

C3orf57

FAM133A

FGFBP2

IL33

SEC14L4

KCNMB2

KRT5

PPP1R14C

POPDC3 TM4SF19

SDCBP2

GATM

GAD1

VAV3

DAPL1

CBR1

CYP39A1

GLULD1

PON3

GHR

RP1336C9.6

KRTAP191

KIAA1622

KLRG2

PRSS21

SLC10A4

PKIB

SRD5A2

CSTA

IL1A

FAM3B

ROPN1

LTB4DH

SPP1

GLI1

SFTPD

FAM81B

PTGS2

BMP7

NEFL

ADH1C

SLCO1B1

PGLYRP4

RPS4Y1

C21orf81

PI3

BNC1

PCDHB6

PAGE1

D4S234E

ZNF365

SCGB1A1

HSD17B3

MYEOV

CBR3

AQP3

KRT6C

LINGO2

CCDC38

SEMA6D

TTC29

WFDC2

FABP6 GABRB3

GCNT3

C12orf56

CLDN1

GSTA5

ATP8A2

PSCA

LEPREL1

ZFP42

SOX2

KRT75

TMPRSS11B

ODZ2

S100P

SLC6A4ZYG11A

DEFB1

TCHHADAMDEC1

TMEM45A

FBN2

SPRR2D

SEMA3E

LRRC17

C10orf81

CTAG2

MAGEA9

NPTX2DEFB103A

EDN3

S100A14

AMY2A

FAM80A

ARMC4

CLCA2

RHCG

IFNE1

CPA6

DEFA3PLA2G3

CDH16

FOXA1

JARID1DCA2

PCDH8

CRNN

DDC

PROM1

ECAT8

KLHL29

KRTDAP

WISP3

SCNN1A

CXCL6

SLC34A2

EYA4

COL10A1

PVRL3

GPR87

KLK11

DNER

FMN2

BEX5

SPRR1B

CRCT1

LOC441376NAP1L2

HS3ST3A1

CPXM2

RP1135N6.1

AQP9

LRRN1

CLC

C1orf61

DSCR6

BCHE

BCL2

KIAA1324TSPAN7

LONRF2

LOC440356

SPINK5

SFRP2

GPR158

OMD

FAM9BIL8

SERPINB13

LYPD3

IGFBP2

P53AIP1WIF1

SLC44A5

MUC20

GPX2 AREG

NRXN1

COCH

HRASLS

MMP1

CA12

WNT16

MAGEA10ASPN

DACT2

MMP3

OGN

KRT16

VCX3AAMDHD1

C6orf142

DDX43

GAS1

FAM112B

SOHLH1

HS3ST2

GPC3

PAMCI

S100A7

NXF5

SERPINB3

STMN2

PCOLCE2

C3orf55

HORMAD1

GULP1

CH25H

COL12A1

CCNA1 CXCL14

KLK6

ITLN1

CALB1C4orf31

MUC16

JAKMIP2

C8orf4

PTHLH

GALNT5

FGB

POU4F1ATP13A4

RASEF

ZFHX4

CDH18

CHL1

PAGE4ITGB6

ACTBL1

MATN3

EBF3

TFF2

SMPX

C1orf87

DOK6

FAM30A

GOLSYN

GJB7

KRT6A

C6orf117

GLYATL1

DPT

ACTL8

GPR65

KIAA1822L

IL12RB2

KRTAP31

HOXC9

SMC1B

LEMD1

OXGR1

ALDH1A1

DNAH9

IRX2

TMEM22

CFTR

RTP4

LMO3

C9orf24

SFTPA1BKLK10

INDOL1

OVOS2 LCN2F3

CDKN2A

C4orf26

INDO

SCGB3A2

GSTT1

PTGIS

KRT23

HAPLN1

TNIP3

GOLT1A

HOXD11

PCSK1

KRT1

FLJ22655

FPRL1 MSMB

IL1F5

FIGF

FABP5

PNLDC1

ZNF334

FGFR3

FLJ35880

VTCN1

HIST1H1A

MAGEC1

EGF

ABCC2

KIT

DKK4

ACE2

LUZP2

LASS3

GAST

FZD10

FHOD3

IL19

LECT1

CRABP2

FLJ45803

GSTM1

TSKS

POSTN

OSGIN1

CYP4F2

GPR128

WNT2

NAPSA

MGAT4C

CTSE

FLJ46266

DMRT2

VWC2

C20orf103

SAA1

ATAD4

GLI2

SLN

NMU

RGS17

SPRR2G

IL1B

ATP8B3

PCDHB5

tcag7.1136

C2orf39

MMP10

AMIGO2 RPL39L

ADRA2C

PKP1

CDK5RAP2

PHACTR3

AADAC

CLEC2B

MAEL

COL21A1

HOXB13

RAPGEFL1

CXCL11

MGC45438

SCRG1

LOC644186

DLX6

IL13RA2

C4orf19

ODAM

GSTO2

SLC16A9

CFHR4

TMEPAI

CCL20

CDH26SPANXD

CCK

FMO1

SLC35F3

SCGB2A1

SLC6A14

HMGA2

LOC285141

TMEM45B

SESN3

KRT27

CYP4X1

SDPR

ZNF229

KLK13

IFIT1

PRB1

MGC16291

CXCL9

TPO

LTF

HOXA4

CHIA

CEACAM7

DSG3

DKK1

MAGEA1MAT1A

MAP1B

KIAA0319

ZBED2

SLC5A1

EDN2

C3orf41

PPP1R3C

SOX11

PKD2L1

CP

CYP24A1

FXYD3

TP63

WDR66

SGEF

BBOX1

EDN1

PTPRR

CAPS

LGI2

NAT8L

C6orf15

SYT13

LRAP

PPBP

SLITRK4

TFAP2B

C6orf105

CXCR7

FAM79BKCNK10

FSTL5

CYP3A5

FETUB

IL23A

LOC399947

WDR16

FGF12

FMO9P

CLDN10

ZNF750

RBP7

DSC1

HS3ST5CDKN2B MME

SOX15

SFTPA1

DPPA2

DEFB4SERPINB11

LGALS2

MAGEA8

F2RL2

ATP2C2

PNPLA3

TUBB2B

PAK7

OLFM4

TSPY2

SCG2

FMO3

DDX3Y

CPA3

BARX1

GAS2

HEY1

ODC1

GPR109B

TSLP

CABYR

MGC10981

LPL

MAGEA11

LRG1

CDH19

LDLRAD1

TP53TG3

IQCA

SAC

C11orf41

COL11A1

C20orf174

ICEBERG

CGB1

ST6GALNAC1

MYB

OR6X1

MPPED2

HOXC13GPNMB

C16orf73

SASP

PLUNC

GJB6

KRT13

ATP4A

NEFM

FAT2

LRRC4

AQP4 LIPH

SELE

THBD

FLJ39822

BTBD16

KRT24

ABCC3

NLRP7

FAM83C

CRLF1

HSPA1A

UGT1A6

UBD

S100A8

KHDRBS2

TMEM100

SPINK2

SPESP1

PDPN

CAPN13

C8orf47

TCN1

INA

COL4A4

NQO1

IFNG

FOXL2

GPR34

KLK5

CCL28

LDHC

ADH1A

GSTM3

MGC102966

COL17A1

XG

MSLN

ELF5

FABP4

CLDN8

EGLN3

PAGE5

AIM2ZMAT4

KRT6B

IL20RB

PRAME

FAM70A

AHNAK2

TFPI2

CYP4F11

LHX2

TCBA1

SFRP1

SCN9A

TNNT1

C19orf59

MUM1L1

CYP26A1

PEBP4

UGT2B17

CEACAM6MS4A2

HS3ST1

SLC6A15

CYorf15BDMRT3

PPP1R9A

BARX2

PENK

ANXA8L2

OPRK1

PNOC

RPS6KA6

SPRR2F

LOC130576

CYP2C9

RUNX1T1

BMP2

ROPN1L

TSPYL5

GAP43

RASGEF1ATMPRSS11D

SLC7A11

NTS

CHIT1

MIA

PCSK2

MMP12

FGFBP1

CHGB

HAS3

SLC1A6

CTA246H3.1

CALCB

PLAT

HOXD10

FAM132A

STAR TBX18

ZNF415

SUSD4

FLRT3LOX

AKR1C1

LYZ

SP8

KCND2

SBSN

MLLT11

TMTC1

C2orf40

HTRA4

VCX2

THNSL2

MAGEB2

EYA2

SPINK1

CILP

AGR3

AADACL2 SLCO1B3

WFDC5

PROK2

DACH1

OSTalpha

CREG2

BDNF

HSPB3

SCIN

LOC124220

NR0B1

CDH2

SERPINB7

NDUFA4L2

STEAP4

P2RY1

ADH7

FHL2

PCDHB2

RAB6B

STK32A

SLPI

DMRTA1

SFRP4

GLI3C9orf18

MAGEC2

C6orf156TGM1

SULT1E1

HIST1H1B

ARL9

AFF2

STXBP5L

CLCA4

SRXN1

FLJ35773

STXBP6

CHST9

STRA8

SERPINB4

VNN1

ALDH3A1

RIBC2

PTH2R GJB2

MICALCL

GSTA3

NPPCSSTR1

ABCA4

RP13102H20.1

SPINK6

ATP12A

PIP

CCND2

PCP4

TMEM46

SAA2

GPR103

C20orf82

OGDHL

CLDN18

RNF128

CD274

COL6A6

RNF183

C9orf47

C12orf54

ABCA3

XK

GDA

UGT2B7

C1QTNF3

CD1A

KLK8

GSTA2

TOX3

MYBPC1

EN1

UNQ9433

POF1B

TKTL1

EPGN

NRG4

NOS2A

GLYATL2

MUC15

RNF182SOX9

POTE15

RGS20

CA9

TSPAN19

APOBEC3A

NPY

SOST

ECHDC3

PAGE2

CTCFL

IL1F9

NELL2

POU3F2

GNGT1

CBLN2

GCLC

GPR110

AMTN

PSMAL

ZNF695

SCTR

CCL11

HOXD8

ZNF114

NRN1

ANXA10

IRX4

CEACAM5

FUT3

ERP27

RNF152

KIAA1257

PLAC1

C19orf46

PHEX

FGF13GABRA5

GRP

MAGEA12

SAGE1

LOC253012CSAG3A

LY6D

LMO1

S100A12

ARSJ

GKN2

SLC5A8

CYP4B1

ABCA13

AGR2

CLCA3

PTN

PADI3NUDT10

PTPRZ1

ARL14

FAM137A

SUNC1PAX9

CST1

CCL7

OLFM1

FLJ44379

PITX2

FA2H

SPOCK3

SH3GL3

ALDH7A1

G0S2

ADCY8

IGFL2

CHODL

IL6

C5orf23

LYPD1

KRT14

KREMEN2

MS4A1

CHP2

CDO1

KCNH8

TRIM49

IGLV214

HOXC12

CXorf57

LCE3D

IL1R2

CYP2C18

IFI44L

TMEM190

NLGN1

RPESP

FAM90A1

CNTNAP2

ABHD12B

SLC47A2

PPP2R2B

NRCAMGBA3

CRTAC1

AKR1C3

C10orf82 CYorf15A

OCA2

CES1

HSD17B2 ITLN2

WDR69

RPS4Y2

ZNF556

LRRK2

HOXD13

KRT17

C6orf159

VIT

HRASLS3

WNT5A

C4orf7

GABRG3

CALB2

MAGEA3

PRDM13

KLK12

ABCA12

CSAG1

EPHB1

CTTNBP2

COL9A1

CALML5

HOTAIR

PLCB4

PLA2G1B

MUCL1

CT456S100A2

GALNT14

PF4V1

REG1A

MGC9913

KRT7

CXorf48

CA3

NDRG4

NRXN3

HOXC8

VNN2

ME1

IVL

AGTR2

ROS1

C6orf54

MAGEA4

TMPRSS11A

EHF

CTNND2

LGR6

FABP7 DSCR8

PROM2

HES5

CRYM

UCHL1

AKAP14

TSPAN8

C6orf150

COX7B2

CYP4F8

TRIM43

OR7A5

ZIC1

FCER1A

ARG1

NPR3

PPP2R2C

ABHD7

TMEM20

SPRR1A

SCEL

TMPRSS2

FLJ21511

HLADRB6

ATP6V0A4

DGKG

MB

CYP4Z1

DMBT1

NLRP2

NELL1

C2orf54

WNT10A

GBP6

ATP6V1C2

CSN1S1

GLT25D2

HHIP

WBSCR28

SPRR3

ART3

PLAC8

SALL3

SLC14A1

SULT1B1

TEX101

KLRD1

DAZ4

FLG

LRP2

TTN

DNAH5

MLL3

MLL2TP53

MACF1

APOB

CNTNAP5

CDH18

SYNE1

ABCA13

PXDNL

CSMD3

MYCBP2

PEG3

DMD

CSMD2

LRRC7

COL11A1

TNN

TNR

RYR2

LRP1B

SCN1A

CDH10

HCN1

DNAH8

PKHD1

ZFHX4

FAM135B

FAT3

AHNAK2ADAM6

HERC2

MUC16

RYR1

TPTE

XIST

PCDH11X

NRXN1

ALMS1

DNAH7

MDN1

CNGB3CDKN2A

NAV3

SYNE2

ZNF99

NEB

ANK2

FAT4

PCLO

ZNF804B

SPTA1

PAPPA2

USH2A

OR2T4

XIRP2

SI

CDH12

HEATR7B2GPR98

FBN2

DNAH11

PLXNA4

PKHD1L1

COL22A1

CUBN

PCDH15

MUC5B

DNAH10

TMEM132D

RYR3

MYH2

ZNF208

ZNF536

DPP10RP1

COL6A6

MUC17

CSMD1

ANKRD30A

C1orf173

CTNNA2

CNTNAP2

FMN2

CPS1

SPHKAP

DNAH9

BIRC6

SLITRK3

FAM5C

ODZ1

MUC4

ADAMTS12

LRFN5

LOC96610

RELN

PDE4DIP

STAB2

Figure 3-8. The Bayesian network produced by HC with the mRNA (circle nodes) andmutation (square nodes) data measured on the same set of 121 LUSC (LungSquamous Cell Carcinoma)

which is linked to the cluster of HOXC13, has been recently verified to play a significant

role in the progression of lung cancer, and patients without CDH12 mutations have a longer

survival rate than those with CDH12 mutation (Zhao et al., 2013). CDH18, which is linked to

the cluster of NPR3, is among the newly identified mutated genes of lung cancer (Liu et al.,

2012). BIRC6 is a potential prognostic biomarker in patients with non-small cell lung cancer

(Gharabaghi, 2016). Additionally, abnormality of TP53 gene is one of the most significant

events in lung cancers and plays an important role in the tumorigenesis of lung epithelial cells

(Mogi & Kuwano, 2011).

73

For comparison, the existing methods, including HC, MMHC, PC and GS, were also

applied to this example. On the same computer as used by the three-stage method, MMHC

took 0.79 CPU hours and HC took 10.89 CPU hours. The GS and PC took more than 40

CPU hours but with no results produced. MMHC is a heuristic algorithm, which employs

some heuristic rules to first learn a set of candidate parents for each node and then conduct

a hill-climbing greedy search to find a sub-optimal network. Potentially, the method can be

pretty due to its heuristic nature. The HC method searches for an optimally scored network

over a large space of networks and hence can be pretty slow. We note that the three-stage

method is currently implemented in R, and its computation is expected to be much shortened if

implemented in C or FORTRAN.

Figure 3.5.1 and Figure 3.5.1 show the Bayesian networks produced by MMHC and HC,

respectively. The former consists of 1295 edges and the latter consists of 11999 edges. The

network produced by the HC method is too dense. The network produced by MMHC looks

reasonably good, although the consistency of the method cannot be guaranteed. This is

consistent with its performance is simulation studies, see Table 3-3.

3.5.2 Glioblastoma Genetic Network with Methylation Adjustment

Covariate effect adjustment is important in learning gene regulatory networks, as the

relationship between variable can be effected by external variables. The covariate effects can

be easily adjusted under the framework provided by the p-learning method. Let w1, ...,Wq

denote the external variables. To adjust their effects, we can replace the p-vales used in the

parents-children screening step by the p-values calculated from the GLM

Xi ∼ 1 + Xj +W1 +W1 + ... +Wq, (3–10)

in testing the hypothesis H0 : γ′ij = 0 versus H1 : γ

′ij = 0, where γ′ij is the regression coefficient

of Xj . Similarly, the p-values used in the moral graph screening step can be replaced by the

74

ACSM3

ADA

AGMAT

ALX1

ANKRD12

ANKRD44

ANXA3

AP3S1

AQP12B

ARF3

ARL11

ASB3

B2M

C11orf64

C16orf30C17orf65

C18orf25

C1QC

C20orf103

C3orf27

C4BPB

C6orf201

C7orf41

C7orf45

CA2

CAMK1G

CAND1

CAPG

CD14

CD163

CLCN4

COL3A1

COX4I1CSF1R

CYP3A5

DCX

DIAPH3

DLX4

DTX1

EIF1AY

EMX2

EPN1

FAM26B

FAM26C

FCGR3A

FCGRT

FLJ25801

FLJ41603

FOXG1

GDF3

GDPD2

GIMAP1

GIMAP4

GLDN

GMEB2

GPC3GPM6B

HBB

HCK

HLA.DMB

HLA.DRA

HLA.F

HMGCLL1

HOPX

HSD17B7

IMMP1L

IMPG1

INDOL1

INPP4B

ISX

ITGA4

ITM2B

KBTBD5

KCNJ6

KCNK4

KCTD12

KIFC3

KLHDC8A

KLK11

KLRF1

LEPR

LLGL2

LOC152586

LOC441601

LOC51057

LOC641367

LOXL3

LPHN3

LRP8 MAGEC1

MAL2

MCM5

MEGF11

MIZF

MTHFD1

MYO1A

NDUFB10

NFYA

NPAL2

NUDCD1 OCA2

OR10A2

OR2F1

OR3A3

OR52M1

OVOS2

PARD6B

PARVBPDZD11

PKP1

POLR2JPOU5F1

POU5F1P3

PSD2

PTPN13

PVRL2

RAD50

RBM7

RGMA

RGN

RGS1

RIT1

RNF180

RP11.114G1.1

SCG3

SERPINA5

SERPINC1

SHC3

SIRPG

SLC28A1

SLC2A13

SLC5A9

SLC7A3

SPANXD

SPRED2

SRF

SRGN

STK19TACSTD1

TAS2R4

TERT

TGM7

THBS4

THOC7

TIA1

TLK2

TM6SF1

TMEM16C

TOM1L1

TRIM50

TSPY2

UBE1L

UBE2C

UBXD4VAPA

VIPR1

WDFY4

WDR12

YES1

ZBTB39

ZNF121

Figure 3-9. Directed Glioblastoma genetic network learned by the three-stage method withmethylation effects having been adjusted.

p-values calculated from GLM

Xi ∼ 1 + Xj +∑k∈Sij

Xk +W1 +W1 + ... +Wq, (3–11)

in testing the hypothesis H0 : β′ij = 0 versus H0 : β

′ij = 0, where Sij is as defined in Algorithm

3.1 and β′ij is the regression coefficient of Xj .

The dataset we considered is for the glioblastoma (GBM) cancer, a highly malignant

brain tumor for which no cure is available. The dataset consists of both methylation and gene

expression data (mRNA-array data) and was downloaded from TCGA at https://tcga-data.nci.nih.gov/tcga/.

We filtered the dataset by including only the 1000 genes with most variable gene expression

values. The number of samples/patients is 281. It is known that the gene expression value

75

can be affected by the methylation sites in the promoter region. For this reason, we annotated

features of the 1000 genes according to their positions on the chromosome.

In the literature, there are a few work made an integrative analysis of gene expression

data and methylation data. For example, Wang et al. (2013) proposed an integrative Bayesian

genomic (iBAG) model, where the direct effects of methylation on gene expression were

first inferred and combined with the gene expression data to predict clinical outcomes. For

this example, we are interested in finding out how genes are regulated by each other after

methylation effects are adjusted. Therefore, we treat the methylation features as covariates.

Since the association-ship between the variables (genes) and covariates (methylation features)

is clear, only covariates associated with Xi and Xj need to be included in (3–10). Similarly, only

the covariates associated with Xi , Xj and XSij\i ,j need to be included in (3–11). Figure 3.5.2

shows the Bayesian network produced by the three-stage method with α1 = 0.2, α2 = 0.01

and α3 = 0.01. There are 262 edges identified in total. Increasing the values of α2 and α3,

more dense networks can be obtained. The total CPU time is about 3 hours on a 3.6GHZ

desktop. Our result is very meaningful: many of the connections indicated by the network has

been verified in the literature. For this example, we can identify several hub genes, e.g., UBE1L

and SHC3. It is known that UBE1L can cause cancer growth suppression (Feng et al., 2008)

and SCH3 affects human high-grade astrocytomas survival (Magrassi et al., 2005).

3.6 Discussion

We have proposed a three-stage method for learning the structure of high-dimensional

Bayesian networks. The three-stage method is to first learn the moral graph of the Bayesian

network based on variable screening and multiple hypothesis tests, and then resolve the local

structure of the Markov blanket for each variable based on conditional independence tests, and

finally identify the derived directions for non-convergent connections based on logical rules. We

justified the consistency of the three-stage method under the small-n-large-p scenario. The

numerical results indicate that the three-stage method significantly outperforms the existing

ones, such as PC, grow-shrink, hill-climbing, and max-min hill-climbing methods.

76

The time complexity of the three-stage method is dominated by its first two stages.

As analyzed in Section 3.2, the time complexity of the first stage is O(p2) and that of

second stage is O(pm2m−1), where m = maxi∈V|Si | denotes the maximum Markov

blanket size of each node. Hence the computational complexity of three-stage method is

O(maxp2, pm2m−1), which is bounded by O(p22m−1). This is much better than the

PC algorithm, which has a computational complexity of O(p2+m) under the same sparsity

assumption. This analysis is consistent with the CPU times reported in Section 3.5.1.

In this paper, we employ the logical rules to derive directions for non-convergent edges.

Alternatively, we can employ a stochastic optimization method, e.g., simulated annealing

(Kirkpatrick et al., 1983) or stochastic approximation annealing (Liang et al., 2014), to

direct those edges by optimizing a selected score function. Liang et al. (2014) showed that

the stochastic approximation annealing algorithm can converge to the global optimum in

probability with a square-root cooling schedule, under which the temperature can decrease

much faster than logarithmic cooling schedule required by simulated annealing. Further,

motivated by hybrid methods, we can employ the stochastic optimization method to resolve the

Markov blankets and identify the derived directions. This will be further studied in our future

research.

Extension of the proposed method to some other types of mixed data is quite straightforward.

For example, for non-Gaussian continuous variables, the nonparanormal transformation

proposed by Jia et al. (2017) can be applied to Gaussianize the data prior to applying

the three-stage method. For Poisson random variables, the random-effect model-based

transformation proposed by can be first applied to continuize the data and the nonparanormal

transformation can then be applied to Gaussianize the data. The negative binomial data can

be treated in the same way. For some other types of discrete data, we wight regroup and treat

them as multinomial data.

Finally, we note that the moral graph produced in the first stage of the three-stage

method is a Markov network. Learning Markov networks for mixed data is also of great interest

77

in the current literature. For example, Cheng et al. (2013) proposed a conditional Gaussian

distribution-based method and Fan et al. (2017) proposed a semiparametric latent variable

method to tackle the problem. The conditional Gaussian distribution used in Cheng et al.

(2013) is similar to (3–3) but including more interaction terms. They used the nodewise

regression method to estimate the Markov network. The semiparametric latent variable method

works by introducing a latent Gaussian variable for each of the discrete variables and then

estimating the Markov network using a regularization method. However, as stated in Fan

et al. (2017), the conditional independence between the latent variables does not imply the

conditional independence between the observed discrete variables.

78

CHAPTER 4CONCLUSIONS AND FUTURE RESEARCH

In this thesis we focused on approaches to solving graphical models on general type of

data.

First, we presented a Poisson graphical model to construct gene regulatory network based

on next generation sequencing data. Compared with the existing local Poisson graphical

model which suffers from inconsistency and thereby can only infer certain local structures

of network, the proposed method is consistent in the sense that the truge gene regulatory

network can be recovered from RNA-seq data when the sample size becomes large. We used

a random effect model-based transformation to continuize NGS data and then we transform

the continuized data to Gaussian via a semiparametric transformation and apply an equivalent

partial correlation selection method to reconstruct gene regulatory networks. Simulation results

demonstrate that the reconstruction accuracy advantages the algorithms presented over LPGM,

TPGM and SPGM.

The major contribution of the proposed method lies on the data-continuized transformation,

which fills the theoretical gap of how to transform NGS data to continuous data and facilitates

learning of gene regulatory networks.

An avenue of further research in this direction, would be the development of providing a

general framework for how to integrate different types of omics data, such as the RNA-seq and

microarray data, to increase statistical power.

Second, we presented an independence-based approach to learning the structure of

Bayesian networks for mixed type of data. The proposed method consists of three stages,

which is to first learn the moral graph of the Bayesian network based on the techniques of

variable screening and multiple hypothesis tests, and then resolve the local structure of the

Markov blanket for each variable based on conditional independence tests, and finally identify

the derived directions for non-convergent connections based on logical rules. The numerical

results indicate that the proposed method performs significantly better than the existing

79

method, such as PC, grow-shrink, hill-climbing, and max-min hill-climbing methods. We also

justified the consistency of the three-stage method under the small-n-large-p scenario. Further

research can be conducted by employing a stochastic optimization method to derive directions

for non-convergent edges instead of using logical rules.

A subject of future research is the extension of Bayesian network to some other types of

mixed data. For example, for non-Gaussian continuous random variables, the nonparanormal

transformation proposed by Liu et al. (2009) can be applied to Gaussianize the data prior

to applying the three-stage method. Since the directions are included in Bayesian network,

it can represent more types of conditional independences than undirected graphical models.

We expect that this method will be used in constructing gene regulatory networks from next

generating sequencing data in future.

The advent of high-throughput techniques has enabled a decreasing running cost of gene

expression and other genomic feature profiling. With so many data, how to conduct systematic

studies of cancer genomes is of great importance. Graphical model is a very natural tool to

learn associations among a large number of genomic features. The methods demonstrated in

this thesis provides us tools to construct both undirected and directed graphical models.

80

APPENDIX ACONSISTENCY OF TRANSFORMATION-BASED METHOD

A.1 Proof of Lemma 1

We first work on the posterior mean of αi . For any ϵ > 0, the mean of the full conditional

posterior distribution of αi is

E [αi |θij , βi , yi ] =∫ ϵ/4

0

αi f (αi |θij , βi , yi)dαi +∫ ϵ/2

ϵ/4

αi f (αi |θij , βi , yi)dαi+∫ ∞

ϵ/2

αi f (αi |θij , βi , yi)dαi f (αi |θij , βi , yi)dαi + ϵ/4 + (I1) + (I2).

It is easy to see that (I1) ≤ ϵ/2. To evaluate (I2), we rewrite f (αi |θij , βi , yi) = g(αi)e−b1αi ,

where g(αi) is an integrable function. Let m = minαi∈[ϵ/4,ϵ/2] g(αi) and M = minαi∈[ϵ/2,∞) g(αi),

which are known to take finite values. Then

I2I1≤ Mm

∫∞ϵ/2e−b1αidαi∫ ϵ/2

ϵ/4e−b1αidαi

=M

m

1

eb1ϵ/4 − 1→ 0,

as b1 →∞. Therefore, E [αi |θij , βi , yi ]→ 0, if b1 →∞.

Since βi |αi , θij , yi follows Gamma(nαi + a2,∑nj=1 θij + b2), we have E [βi |αi , θij , yi ]→ 0 as

b2 →∞. With the same argument, we have

E [θij |αi , βi , yi ] = (yij + αi)/(βi + 1)→ yij ,

as b1 → ∞ and b2 → ∞. By the law of iterated expectations, we have E [θij |yi ] → yij as

b1 →∞ and b2 →∞.

A.2 Existing Theory of Adaptive MCMC

Since the prior hyperparameters are changing with iterations, the resulting posterior

distribution is also changing with iterations. Hence, the proposed sampling algorithm falls into

the class of adaptive MCMC algorithms. For this type of adaptive MCMC algorithm for which

the target distribution changes with iterations, the ergodicity theory has been developed in Fort

et al. (2011) and Liang et al. (2016). Here we adopted the theory developed by Liang et al.

(2016).

81

To facilitate our study, we first define some notations for adaptive Markov chains.

Consider a state space (X,F , where F = B(X) denotes the Borel set defined on X. Let

Xt ∈ X denote the state of the Markov chain at iteration t, and let Pγt denote the transition

kernel at iteration t, where γt is a realization of a Y-valued random variable Γt . In simulations,

γt is updated according to a specific rule. Let Gt = σ(X0, ...,Xt , Γ0, ..., Γt) be the filtration

generated by (Xi , Γi)ti=0. Thus,

Let P tγ(x ,N) = Pγ(Xt ∈ B|X0 = x) denote the t-step transition probability for

the Markov chain with the fixed transition kernel Pγ and the initial condition X0 = x . Let

P t((x , γ),B) = P(Xt ∈ B|X0 = x , Γ0 = γ),B ∈ F , denote the t-step transition probability

for the adaptive Markov chain with initial conditions X0 = x and Γ0 = γ. Let

T (x , γ, t) =∥ P t((x , γ), ·)− π(·) ∥= supB∈F |P t((x , γ),B)− π(B)|

denote the total variation distance between the distribution of the adaptive Markov chain

at time t and the target distribution π(·). It is said the adaptive Markov chain ergodic of

limt→∞T (x , γ, t) = 0 for all x ∈ X and γ ∈ Y.

For the proposed algorithm, since Γt = (b(t)1 , b

(t)2 ) takes values in a deterministic

sequence, the ergodicity theory developed in Liang et al. (2016) can be re-stated as follows.

Theorem A.1. (Ergodicity; Liang et al. (2016)) Consider an adaptive Markov chain defined on

the state space (X,F) with the adaption index Γt ∈ Y. The adaptive Markov chain is ergodic

if the following conditions are satisfied:

(a) (stationary) There exists a stationary distribution πγt(·) for each transition kernel Pγt ,where γt denotes a realization of the random variable Γt .

(b) (Asymptotic Simultaneous Uniform Ergodicity) For any ϵ > 0, there exist constantsK(ϵ) > 0 and N(ϵ) > 0 such that

supx∈X∥ PnΓt(x , ·)− π(·) ∥≤ ϵ,

for all t > K(ϵ) and n > N(ϵ).

82

Theorem A.2. (Weak Law of Large Numbers; Liang et al. (2016)) Consider an adaptive

Markov chain defined on the state space (X,F). Suppose that conditions (a), (b) and (c) of

Theorem A.1 hold. Let λ(·) be a bounded measurable function. Then

1

n

n∑t=1

λ(Xt)→ π(λ), in probability,

as n →∞, where π(λ) =∫X λ(x)π(dx).

A.3 Proof of Lemma 2

Since the low of β(t)i and the law of θ(t)ij are completely determined by the law of α(t)i ,

where the supscript t indicates the iteration number, out analysis concentrates on the con-

vergence of α(t)i . For notational simplicity, we rewrite b(t)1 and γt and rewrite f (αi |θij , βi , yij)

as fγt(x) in what follows. For the proposed algorithm, γt takes values in a deterministic and

monotone sequence as specified in Equation (2–5) of the main text.

Since the MH algorithm was used for simulating from fγt(x)m the condition (a) holds.

As shown below, for the proposed algorithm, the posterior distribution π(·) converges to a

Dirac delta measure. Hence, following from Theorem 2, the posterior mean can be obtained by

setting λ(x) to a truncated function: λ(x) = x if |x | < M and M otherwise, provided that M

is large enough such that the interval [−M,M] covers all y ′ijs . In summary, to prove Lemma 2,

it suffices to verify the conditions (b) and (c).

Verification of condition (c). Write the target density function as

fγt(x) = g(x)e−γtx ,

where γt is the adaptive parameter taking the form

γt = γt−1 +c

tς, t = 1, 2, ...,

83

for some constants γ0 > 0, c > 0 and ς ∈ (0, 1]. Let q(x , y) = q(|y − x |) denote a

random-walk proposal distribution. Define

sγ(x , y) = q(x , y)min1,g(y)e−γy

g(x)e−γxq(y , x)

q(x , y),

and rγ(x , y) = sγ(x , y)/q(x , y). Then, for any Borel set B, the transition kernel

Pγ(X ,B) =

∫B

sγ(x , y)dy + I (x ∈ B)[1−

∫Xsγ(x , z)dz

].

For the derivative dsγ(x , y)/dγ, we have

|dsγ(x , y)/dγ| = |q(x , y)I (rγ(x , y) < 1)rγ(x , y)(y − x)| ≤ q(x , y)|y − x |.

By the mean-value theorem, there exists a constant c1 such that

|sγ(x , y)− sγ′(x , y)|dy ≤ c2|γ′ − γ|,

as the proposal is a random walk proposal. Therefore,

|pγ(x ,B)− Pγ′(x ,B)| ≤ 2c2|γ − γ′|,

which implies that there exists a constant c2 such that∫X |sγ(x , y)− sγ′(x , y)|dy ≤ c2|γ

′ − γ|,

as the proposal is a random walk proposal. Therefore,

|Pγ(x ,B)− Pγ′(x ,B)| ≤ 2c2|γ − γ′|,

and,

Dt = supx∈X|Pγt+1(x , ·)− Pγt(x , ·)| ≤ 2

c2c0(t + 1)ς

→ 0,

as t →∞.

84

Verification of condition (b). Let P(x, B) denote a degenerated MH transition kernel

for the Dirac delta measure π(x) = δ(x = 0), i.e.,

p(x ,B) =

1, if 0 ∈ B.

0, otherwise

Then it is easy to see that sup ∥ Pγt(x ,B)− p(x ,B) ∥→ 0 as t →∞.

For any k ≥ 1 and any ψ : X → [−1, 1], we have

Pkγtψ(x0)− π(ψ) = S1(k) + S2(k),

where π(ψ) =∫ψ(x)π(x)dx , and

S1(k) = Pkψ(x0)− π(ψ), S2(k) = Pkγtψ(x0)− P

kψ(x0).

Since P(x ,B) is degenerated, we have S1(k) = 0 for all k ≥ 1. For the term S2(k), we can

further decompose it as follows: For any k0(1 ≤ k0 < k),

|S2(k)| ≤ |Pkγtψ(x0)− Pk0γt ψ(x0)|+ |P

k0γt ψ(x0)− P

kψ(x0)|+ |Pk0ψ(x0)− Pkψ(x0)|

=|k0−1∑m=0

[PmPk0−mγt ψ(x0)− Pm+1Pk0−(m+1)γt ψ(x0)]|+ |Pkγtψ(x0)− Pk0γt ψ(x0)|+

|Pkψ(x0)− pk0ψ(x0)|

=|k0−1∑m=0

Pm(Pγt − P)Pk0−(m+1)γt ψ(x0)|+ |Pkγtψ(x0)− Pk0γt ψ(x0)|+ |P

kψ(x0)− Pk0ψ(x0)|.

Since supx ∥ Pγt(x ,B) − p(x ,B) ∥→ 0 as t → ∞, for any ϵ > 0, there exist some L(ϵ) such

that for any t > L(ϵ),

|S2(k)| ≤ 4k0ϵ+ |Pkγtψ(x0)− Pk0γt ψ(x0)|+ |P

kψ(x0)− Pk0ψ(x0)|

= 4k0ϵ+ S3(t, k , k0) + S4(k , k0).

Since P(x ,B) is a degenerated, we have S4(k , k0) = 0 for any k > 0 and k0 > 0. As shown in

(A.3), γt forms a monotone and deterministic sequence. With such a deterministic sequence,

85

Pγt converges faster and faster as t → ∞. Hence, there exists some K(ϵ) and L′(ϵ) such that

for any k > k0 ≥ K(ϵ), t ≥ L′(ϵ),

S3(t, k , k0) ≤ ϵ.

Let L(ϵ) = maxL(ϵ),L′(ϵ. Furthermore, one can choose K(ϵ) such that ϵK(ϵ) → 0 as

ϵ→ 0.

Setting ϵ = ε/(4K(ε) + 1) and summarizing the results of S1(k) and S2(k), we conclude

the following: for any ϵ > 0 and any x0 ∈ X , there exists L(ϵ) ∈ N and k(ϵ) ∈ N such that for

any t > L(ϵ) and k > K(ϵ),

∥ Pkγt(θ0, ·)− π(·) ∥≤ ς.

Note that ς = (4K(ϵ) + 1)→ 0 as ϵ→ 0. Condition (b) is verified.

86

APPENDIX BCONSISTENCY OF PROPOSED THREE-STAGE METHOD

This appendix establishes the consistency of the proposed three-stage method for learning

the structure of Bayesian networks under small-n-large-p scenario. It consists of two parts. The

first part establishes the consistency of the moral graph learning algorithm, and the second part

establishes the consistency of the v -structure identification algorithm

B.1 Consistency of Moral Graph Learning

To indicate that p can grow as a function of n, we rewrite p as pn, rewrite the distribution

function P in 3–1 as P(n), and rewrite the true Bayesian network G as G(n) = (V(n),E(n)).

Let G(n) = (V(n), E (n)) denote the marginal association network, where V(n) = V(n) and the

association is measured by the coefficients of the marginal regression

Xi ∼ 1 + Xj , i , j = 1, 2, ..., pn, (B–1)

which can be normal linear regression or multiclass logistic regression depending on the type

of Xi . Let γij denote the coefficient of Xj in B–1, which is called the marginal regression

coefficient (MRC) in this paper. Them we have

E (n) = (i , j) : γij = 0, i , j = 1, ..., pn.

Let νn denote a threshold value of the MRC, let Eνn denote the edge set the network obtained

through MRC thresholding at νn, and let Eνn,i denote the neighborhood of node i in Eνn . That

is, we define

Eνn = (i , j) : |γij | > νn, and Eνn,i = j : j = i , |γij | > νn. (B–2)

For convenience, we call the network with the edge set Eνn the thresholding MRC network.

Similarly, we let βij denote the regression coefficient of Xj in the node-wise GLM

Xi ∼ 1 + Xj +∑

k∈V(n)\i ,j

Xk . (B–3)

87

Following from the total conditioning property of Bayesian networks Pellet & Elisseeff (2008),

which shows that Xj ∈ Si ⇐⇒ Xi ⊥ Xj |V \ Xi ,Xj, we have βij = 0 ⇐⇒ Xj ∈ Si . Let

E(n)mb = (i , j), βij = 0, i , j = 1, ..., pn denote the edge set of the moral graph. We partition

E(n)mb into two subsets E(n)p = (i , j) : βij = 0, γij = 0 and E

(n)s = (i , j) : βij = 0, γij = 0. The

former set contains the parent-child links as well as the spouse links for which the two spouse

variables are marginally dependent. The latter set contains the spouse links for which the two

spouse variables are marginally independent, but dependent conditioned on their common child.

Let Zi = 1,Xi ,1, ...,Xi ,qn′, whereXi ,1, ...,Xi ,qn ⊂ X1,X2, ...,Xpn \ Xi, and qn

is bounded by O(n/log(n)). In this paper, qn is allowed to increase with n at an appropriate

rate. The regression model Xi ∼ Zi is assumed with quasi-likelihood function −l(ZTi ξi, Xi),

where ξi denote the vector of regression coefficients. Let

ξ∗i = argminξiEl(Ziξi ,Xi), (B–4)

be the population parameter, and

ξ∗i = argmin

ξiPnl(Ziξi ,Xi), (B–5)

be the maximum likelihood estimator (MLE), where Pnf (X ,Y ) = n−1∑n

i=1 f (Xi ,Yi) is the

empirical measure and

l(X ; θ) = −[θX − b(θ)− log c(X )], (B–6)

denotes the log-density function (in the canonical form) of the exponential family, where b(·)

and c(·) denote some known functions. Assume that ξ∗i is an interior point of a sufficiently

large, compact and convex set F ∈ Rqn+1. For any pair (Zi ,Xi), the following conditions are

assumed:

(E1) The Fisher information

I (ξi) = E

[∂

∂ξil(ZTi ξi ,Xi)

] [∂

∂ξil(ZTi ξi ,Xi)

]T,

88

is finite and positive at ξi = ξ∗i . Moreover, ∥I (ξi)∥F = supξi∈F,∥z∥=1∥I (ξi)1/2z∥ exists,

where ∥ · ∥ is the Euclidean norm.

(E2) The function l(zTi ξi , xi) satisfies the Lipschitz property with positive constant kn

|l(zTi ξi , xi)− l(zTi ξ′i , xi)|In(zi , xi) ≤ kn|zTi ξi − zTi ξ′i |In(zi , xi),

for ξi , ξ′i ∈ B, where In(zi , xi) = I ((zi , xi) ∈ Ωn) with

Ωn = (z, x) : |(z, x)|∞ ≤ Kn

for some sufficiently large positive constants Kn, and ∥ · ∥∞ being the supremumnorm. In addition, there exists a sufficiently large constant C such that with bn =CknV

−1n (q/n)

1/2 and Vn given in condition C

supξi∈B,∥ξi−ξ∗i ∥≤bn

|E [l(ZTi ξ,Xi)− l(ZTi ξ∗i ,Xi)](1− In(Zi ,Xi))| ≤ o(q/n),

where Vn is the constant given in condition (E3).

(E3) The function l(XTi ξi ,Xi) is convex in ξi , satisfying

E(l(ZTi ξi ,Xi)− l(ZTi ξ∗i ,Xi)) ≥ Vn∥ξi − ξ∗i ∥2

for all ∥ξi − ξ∗i ∥ ≤ bn and some positive constants Vn.

(E4) There exists some positive constants m0, m1, s0, s1 and α, such that for sufficiently larget,

P(|Xj | > t) ≤ (m1 − s1) exp−m0tα, j = 1, ... , pn,

and that

E exp(b(Z

T

i ξi + s0)− b(ZT

i ξi))+ E exp

(b(Z

T

i ξi − s0)− b(ZT

i ξi))≤ s1,

where ξi = βij : βij = 0, j ∈ P(i), Pi = j : (i , j) ∈ E(n)p , and Zi contains the

corresponding predictors, that is, ZT

i ξi = βi0 +∑j∈P(i) Xjβij .

(E5) The variance Var(ZT

i ξi) is bounded from above and below, where Zi and ξi are asspecified in condition (E4).

(E6) Either b′′(·) is bounded or XM = (X1, ... ,Xpn)

T follows an elliptically contoureddistribution, that is,

XM = Σ1/2RU,

and |Eb′(ZTi ξi)(ZT

i ξi − βi0)| is bounded, where U is uniformly distributed on the unitsphere in p-dimensional Euclidean space, independently of the nonnegative random

89

variable R, Σ = Var(XM), and λmax(Σ) = O(nτ) for some constant 0 ≤ τ < 1 − 2κ,

where κ is as defined in Condition (B) in Section 3.2.4.

Assumption (E6) implies that the largest eigenvalue of Σ is allowed to grow with n, but

the growth rate should be restricted. Otherwise, the resulting thresholding correlation network

can be dense. First, it follows from the definition of P(i) and Condition (D) of Section 3.2.4,

there exists a constant c2 such that

miniminj∈P(i)

|γij | ≥ c2n−κ. (B–7)

Lemma 3 concerns the sure screening property of the thresholded associated network,

which follows Theorem 4 of Fan et al. (2010).

Lemma 3. Suppose that the conditions (A), (B), (E1)-(E4) hold.

(i) If Kn = o(n(1−2κ)/(α+2)), then for any c3 > 0, there exists a positive constant c4 such

that

P

(max1≤i ,j≤pn

|γij − γij | ≥ c3n−κ)≤ O(p2n exp(−c4n(1−2κ)α/(α+2))) = o(1). (B–8)

(ii) If, in addition, the condition (D) holds, then by taking νn = c5n−κ with 0 < c5 ≤ c2/2,

we have

P(P(i) ⊆ Eνn,i) ≥ 1−O(pn exp(−c4n(1−2κ)α/(α+2))) = 1− o(1), (B–9)

P(E(n)p ⊆ Eνn) ≥ 1−O(p

2n exp(−c4n(1−2κ)α/(α+2))) = 1− o(1). (B–10)

Lemma 4. Suppose that the condition (A), (B), (E1)-(E6) hold. If Kn = o(n(1−2κ)/(α+2)),

then, for any νn = c5n−κ, we have

P(|ξνn,i ≤ On2κ+τ) ≥ 1−O(pn exp(−c4n(1−2κ)α/(α+2))) = 1− o(1). (B–11)

Since the exact value of 2κ + τ is unknown, we may bound the size of the neighboring set

ξνn,i by O(n/ log(n)) in practice. However, when n is large, n/ log(n) can be too large. An

excessively large size of the set will adversely affect on the power of the moral graph screening

tests. To address this issue, we propose a multiple hypothesis test-based procedure, i.e., step

90

(a)-(ii) for pre-identification of the nonzero marginal association measure. To justify this

procedure, we have the following lemmas.

Lemma 5. Assume conditions (A), (B), (D), (E1)-(E4) hold. If ηn =12c2n

−κ, where c2 is

defined in B–7, then

P[E(n)p ⊂ ξηn ] = 1− o(1), as n →∞.

Proof. Let Aij denote that an error event occurs when testing the hypotheses H0 : γij = 0

versus H1 : γij = 0 for variables Xi and Xj . Let AIij and AIIij denote the false positive and false

negative errors, respectively. Then Aij = AIij ∪ AIIij , where

False positive error AIij : |γij | > c22n−κ and γij = 0,

False negative error AIIij : |γij | ≤ c22n−κ and γij = 0.

(B–12)

By (B–7), minij |γij | ≥ c2n−κ for the links in E(n)p . Therefore, by Lemma 3-(i),

P[Missing a link of E(n)p in ξηn ] ≤ P(max1≤i ,j≤pn

|γij − γij | ≥ c2/2n−κ)≤ o(1), (B–13)

which concludes the proof.

Therefore, based on Lemma 3, Lemma 4 and Lemma 5, we propose to restrict the size of

the set Ai (in Algorithm 3.1) for each node to be

nsizemin

|ξηn,i ,

n

cn1 log(n)

, (B–14)

where cn1 is a small constant, e.g., cn1 = 1, 2, or 3. The value of ηn can be determined

through a simultaneous test for the hypothesis H0 : γij = 0 ↔ H1 : γij = 0, 1 ≤ i < j ≤ pn, at

a significance level of α1.

Lemma 6 concerns the convergence of MLE of the regression coefficients for which all

the true predictors have been included. The lemma is a restatement of Theorem of Fan et al.

(2010).

91

Lemma 6. Assume condition (A), (B), (E1)-(E3) hold. If Kn = o(n1−2κ)/(α+2)), then, for any

constant c7 > 0, there exists a constant c8 > 0 such that

P( max1≤i≤pn

|ξi − ξ∗i | ≥ c7n−κ) ≤ O(pn exp(−c8n(1−2κ)α/(α+2))) = o(1), (B–15)

where ξ∗i is defined in (B–4) and ξi is the MLE of ξ∗i .

Recall that if the Markov blanket Si (of node Xi) is contained in Zi , then ξ∗i ,ik= βij for

j ∈ Si and Xj = Xi ,ik , and ξ∗i ,ik = 0 otherwise.

Let βij denote the estimate of βij obtained in step (c) of Algorithm 3.1. Let ςn denote the

threshold value of βij , and Emb,ςn denote the network obtained through thresholding βij . That

is, we define

Emb,ςn = (i , j) : |βij | > ςn.

To establish the consistency of Emb,ςn , we first note that as implied by Condition (D) the

total conditioning property of Bayesian network, there exists a constant c6 such that the true

regression βij defined in (B–3) satisfy

miniminj∈Si|βij | ≥ c6n−κ, (B–16)

where κ is as defined in Condition (B). Let xi∗ denote the edge set of a marginal association

network for which each node has a degree of O(n/ log(n)), adjacent with O(n/ log(n))

highest associated nodes. It follows from Lemmas 3 and 4 that

P[E(n)p ⊆ ξ∗

]≥ 1−O(p2n exp(−c4n(1−2κ)α/(α+2))) = 1− o(1). (B–17)

Let B = ∪pni=1Bi , where Bi us defined in step (b) of the p-screening algorithm. We have

E(n)s ⊂ B. Further, by Lemma 5, we have E(n)mb = E

(n)p ∪ E

(n)s ⊆ (ξ∗ ∩ ξηn) ∪ B.

Lemma 7 establishes the consistency of Emb,ςn as an estimate of E(n)mb conditioned on

E(n)mb ⊆ (ξ∗ ∩ ξηn) ∪ B. Its proof follows closely the proof of Lemma 5 on (B–16) and is thus

omitted here.

92

Lemma 7. Assume that the conditions (A), (B), (C), (D) and (E1)-(E6) hold and E(n)mb ⊆

(ξ∗ ∩ ξηn) ∪ B is true. Let ςn =12c6n

−κ. If Kn = o(n(1−2κ)/(α+2), then

P[Emb,ςn = E

(n)mb|E

(n)mb ⊆ (ξ∗ ∩ ξηn) ∪ B

]= 1− o(1), as n →∞.

As a summary for the above results, we have the following theorem, which establishes the

consistency of Emb,ςn as an estimate of the adjacency matrix of the moral graph E(n)mb.

Theorem B.1. Consider a Bayesian network distribution P(n) defined in 3–1 for mixed

GLM variables. Assume the conditions (A), (B), (C), (D) and (E1)-(E6) hold. If Kn =

o(n1−2κ)/(α+2)), then

P[Emb,ςn = E

(n)mb

]≥ 1− o(1), as n →∞.

Proof. By invoking Lemma 5, B–17, and Lemma 7, we have

P[Emb,ςn = E

(n)mb

]≥ P

[Emb,ςn = E

(n)mb|E

(n)mb ⊆ (ξ∗ ∩ ξηn) ∪ B

]P[E(n)mb ⊆ (ξ∗ ∩ ξηn) ∪ B

]≥ [1− o(1)][1− o(1) + 1− o(1)− 1]

= 1− o(1),

which concludes the proof.

B.2 Consistency of v -structure Identification

Since the consistency of the collider set algorithm has been proved by Pellet & Elisseeff

(2008) for the low-dimensional problem, the algorithm is correct and we only need to prove

that the total errors, including both type-I and type-II errors, of the condtional independence

tests involved in the algorithm can be kept at a zero level in probability as n and p goes to

infinity.

Let Dij = D : D ⊆ Si \ j denote the set of all possible subset of Si \ j. Therefore,

the cardinality of Dij is 2|Si |−1, which is upper bounded by 2m−1. Hence m denotes the upper

bound of the Markov blanket size. Let F = D : Xi ⊥p Xj |XD,D ∈ Dij. Then, it follows from

93

Condition (D) that there exists a constant c9 > 0 such that

miniminj∈SiminD∈F

βij |D ≥ c9n−κ, (B–18)

where βij |D denotes the regression coefficient defined in (3–6). Let λn denote the critical value

of the test of the hypothesis: H0 : βij |D = 0 versus H1 : βij |D = 0. Then a v -structure can

be identified if we find a set D ⊂ Dij such that βij |D < λn. Let E(n)

v ,λn denote the v -structure

identified with the critical value λn.

Theorem B.2. Consider a Bayesian network with distribution P(n) defined in (3–1) for mixed

GLM variables. Assume the conditions (A), (B), (C), (D) and (E1)-(E3) hold. Let λn =c92n−κ.

If Kn = o(n(1−2κ)/(α+2)), then

P[E(n)

n,λn = E(n)v |Emb,ςn = E

(n)mb] ≥ 1− o(1), as n →∞.

Proof. Let Aij |D denote an error event occurs when testing the hypothesis H0 : βij |D = 0 versus

H1 : βij |D = 0 for variables Xi and Xj . Let AIij |D and AIIij |D denote the false positive and false

negative errors, respectively. Then Aij |D = AIij |D ∪ AIIij |D, where

False positive error AIij : |βij |D| > c92n−κ and βij |D = 0,

False negative error AIIij : |βij |D| ≤ c22n−κ and βij |D = 0.

(B–19)

By (B–18), we have mini minj∈Si minD∈F βij |D ≥ c9n−κ. Therefore, by Lemma 6,

P[E(n)

v ,λn = E(n)v |Emb,ςn = E

(n)mb] ≤ P

(max

1≤i≤pn,j∈Si ,D∈Dij|βij |D − βijD| ≥ c9/2n−κ

)≤ O

(pnm2

m−1 exp(−c8n(1−2κ)α/(α+2)) (B–20)

where m = nb with b as defined in condition (C) of Section 3.2.4. This concluded the

proof.

94

REFERENCES

Aguiar, M., Masse, R. & Gibbs, B. F. (2005). Regulation of cytochrome p450 byposttranslational modification. Drug metabolism reviews 37, 379–404.

Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S. & Koutsoukos, X. D.(2010). Local causal and markov blanket induction for causal discovery and feature selectionfor classification part i: Algorithms and empirical evaluation. Journal of Machine LearningResearch 11, 171–234.

Allen, G. I. & Liu, Z. (2013). A local poisson graphical model for inferring networks fromsequencing data. IEEE transactions on nanobioscience 12, 189–198.

Anders, S. & Huber, W. (2010). Differential expression analysis for sequence count data.Genome biology 11, R106.

Barabasi, A.-L. & Albert, R. (1999). Emergence of scaling in random networks. science286, 509–512.

Benjamini, Y., Krieger, A. M. & Yekutieli, D. (2006). Adaptive linear step-upprocedures that control the false discovery rate. Biometrika , 491–507.

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal ofthe Royal Statistical Society. Series B (Methodological) , 192–236.

Bouckaert, R. R. (2001). Bayesian belief networks: from construction to inference .

Cantile, M., Scognamiglio, G., Anniciello, A., Farina, M., Gentilcore, G.,Santonastaso, C., Fulciniti, F., Cillo, C., Franco, R., Ascierto, P. A.et al. (2012). Increased hox c13 expression in metastatic melanoma progression. Journal oftranslational medicine 10, 91.

Carvalho, R. H., Haberle, V., Hou, J., van Gent, T., Thongjuea, S., vanIJcken, W., Kockx, C., Brouwer, R., Rijkers, E., Sieuwerts, A. et al. (2012).Genome-wide dna methylation profiling of non-small cell lung carcinomas. Epigenetics &chromatin 5, 9.

Cheng, J., Li, T., Levina, E. & Zhu, J. (2013). High-dimensional mixed graphicalmodels. arXiv:1304.2810 .

Chickering, D. M. (1996). Learning bayesian networks is np-complete. In Learning fromdata. Springer, pp. 121–130.

Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal ofmachine learning research 3, 507–554.

Colombo, D., Maathuis, M. H., Kalisch, M. & Richardson, T. S. (2012).Learning high-dimensional directed acyclic graphs with latent and selection variables. TheAnnals of Statistics , 294–321.

95

Danaher, P., Wang, P. & Witten, D. M. (2014). The joint graphical lasso for inversecovariance estimation across multiple classes. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 76, 373–397.

De Montellano, P. R. O. (2005). Cytochrome P450: structure, mechanism, andbiochemistry. Springer Science & Business Media.

DeKelver, R. C., Lewin, B., Lam, K., Komeno, Y., Yan, M., Rundle, C.,Lo, M.-C. & Zhang, D.-E. (2013). Cooperation between runx1-eto9a and noveltranscriptional partner klf6 in upregulation of alox5 in acute myeloid leukemia. PLoS Genet9, e1003765.

Dempster, A. P. (1972). Covariance selection. Biometrics , 157–175.

Dobra, A., Lenkoski, A. et al. (2011). Copula gaussian graphical models and theirapplication to modeling functional disability data. The Annals of Applied Statistics 5,969–993.

Fan, J., Liu, H., Ning, Y. & Zou, H. (2017). High dimensional semiparametric latentgraphical model for mixed data. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 79, 405–421.

Fan, J. & Lv, J. (2008). Sure independence screening for ultrahigh dimensional featurespace. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70,849–911.

Fan, J., Song, R. et al. (2010). Sure independence screening in generalized linear modelswith np-dimensionality. The Annals of Statistics 38, 3567–3604.

Faner, R., Cruz, T., Casserras, T., Lopez-Giraldo, A., Noell, G., Coca, I.,Tal-Singer, R., Miller, B., Rodriguez-Roisin, R., Spira, A. et al. (2016).Network analysis of lung transcriptomics reveals a distinct b-cell signature in emphysema.American journal of respiratory and critical care medicine 193, 1242–1253.

Feng, Q., Sekula, D., Guo, Y., Liu, X., Black, C. C., Galimberti, F., Shah,S. J., Sempere, L. F., Memoli, V., Andersen, J. B. et al. (2008). Ube1l causeslung cancer growth suppression by targeting cyclin d1. Molecular cancer therapeutics 7,3780–3788.

Fort, G., Moulines, E. & Priouret, P. (2011). Convergence of adaptive andinteracting markov chain monte carlo algorithms. The Annals of Statistics , 3262–3289.

Friedman, J., Hastie, T. & Tibshirani, R. (2008). Sparse inverse covariance estimationwith the graphical lasso. Biostatistics 9, 432–441.

Gallopin, M., Rau, A. & Jaffrezic, F. (2013). A hierarchical poisson log-normal modelfor network inference from rna sequencing data. PloS one 8, e77503.

96

Gharabaghi, M. A. (2016). Diagnostic investigation of birc6 and sirt1 protein expressionlevel as potential prognostic biomarkers in patients with non-small cell lung cancer. TheClinical Respiratory Journal .

Ha, M. J., Sun, W. & Xie, J. (2015). Penpc: A two-step approach to estimate theskeletons of high-dimensional directed acyclic graphs. Biometrics .

Hammersley, J. M. & Clifford, P. (1971). Markov fields on finite graphs and lattices .

Harris, N. & Drton, M. (2013). Pc algorithm for nonparanormal graphical models.Journal of Machine Learning Research 14, 3365–3383.

Heckerman, D., Geiger, D. & Chickering, D. M. (1995). Learning bayesian networks:The combination of knowledge and statistical data. Machine learning 20, 197–243.

Herskovits, E. H. & Cooper, G. F. (2013). Kutato: An entropy-driven system forconstruction of probabilistic expert systems from databases. arXiv preprint arXiv:1304.1088 .

Hoff, P. D. (2007). Extending the rank likelihood for semiparametric copula estimation. TheAnnals of Applied Statistics , 265–283.

Humbert, M., Halter, V., Shan, D., Laedrach, J., Leibundgut, E. O., Baer-locher, G. M., Tobler, A., Fey, M. F. & Tschan, M. P. (2011). Deregulatedexpression of kruppel-like factors in acute myeloid leukemia. Leukemia research 35, 909–913.

Inouye, D. I., Ravikumar, P. & Dhillon, I. S. (2016). Square root graphicalmodels: Multivariate generalizations of univariate exponential families that permit positivedependencies. In JMLR workshop and conference proceedings, vol. 48. NIH Public Access.

Jia, B., Xu, S., Xiao, G., Lamba, V. & Liang, F. (2017). Learning gene regulatorynetworks from next generation sequencing data. Biometrics .

Kalisch, M. & Buhlmann, P. (2007). Estimating high-dimensional directed acyclic graphswith the pc-algorithm. Journal of Machine Learning Research 8, 613–636.

Kirkpatrick, S., Gelatt, C. D., Vecchi, M. P. et al. (1983). Optimization bysimulated annealing. science 220, 671–680.

Kjaerulff, U. B. & Madsen, A. L. (2008). Bayesian networks and influence diagrams.Springer Science+ Business Media 200, 114.

Kolaczyk, E. D. (2009). Statistical analysis of network data: Methods and Models.Springer.

Lam, W. & Bacchus, F. (1994). Learning bayesian belief networks: An approach based onthe mdl principle. Computational intelligence 10, 269–293.

Lauritzen, S. L. (1996). Graphical models, vol. 17. Clarendon Press.

97

Lee, J. & Hastie, T. (2013). Structure learning of mixed graphical models. In ArtificialIntelligence and Statistics.

Lee, J. D. & Hastie, T. J. (2015). Learning the structure of mixed graphical models.Journal of Computational and Graphical Statistics 24, 230–253.

Liang, F., Cheng, Y. & Lin, G. (2014). Simulated stochastic approximation annealing forglobal optimization with a square-root cooling schedule. Journal of the American StatisticalAssociation 109, 847–863.

Liang, F., Jin, I. H., Song, Q. & Liu, J. S. (2016). An adaptive exchange algorithm forsampling from distributions with intractable normalizing constants. Journal of the AmericanStatistical Association 111, 377–393.

Liang, F., Song, Q. & Qiu, P. (2015). An equivalent measure of partial correlationcoefficients for high-dimensional gaussian graphical models. Journal of the AmericanStatistical Association 110, 1248–1265.

Liang, F. & Zhang, J. (2008). Estimating the false discovery rate using the stochasticapproximation algorithm. Biometrika .

Liu, H., Lafferty, J. & Wasserman, L. (2009). The nonparanormal: Semiparametricestimation of high dimensional undirected graphs. Journal of Machine Learning Research 10,2295–2328.

Liu, P., Morrison, C., Wang, L., Xiong, D., Vedell, P., Cui, P., Hua, X., Ding,F., Lu, Y., James, M. et al. (2012). Identification of somatic mutations in non-small celllung carcinomas using whole-exome sequencing. Carcinogenesis 33, 1270–1276.

Magrassi, L., Conti, L., Lanterna, A., Zuccato, C., Marchionni, M., Cassini,P., Arienta, C. & Cattaneo, E. (2005). Shc3 affects human high-grade astrocytomassurvival. Oncogene 24, 5198–5206.

Margaritis, D. (2003). Learning Bayesian network model structure from data. Ph.D. thesis,US Army.

Margaritis, D. & Thrun, S. (1999). Bayesian network induction via local neighborhoods.Tech. rep., DTIC Document.

Mazumder, R. & Hastie, T. (2012). The graphical lasso: New insights and alternatives.Electronic journal of statistics 6, 2125.

McGeachie, M. J., Chang, H.-H. & Weiss, S. T. (2014). Cgbayesnets: conditionalgaussian bayesian network learning and inference with mixed discrete and continuous data.PLoS Comput Biol 10, e1003676.

Meek, C. (1995). Causal inference and causal explanation with background knowledge.In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. MorganKaufmann Publishers Inc.

98

Meinshausen, N. & Buhlmann, P. (2006). High-dimensional graphs and variable selectionwith the lasso. The Annals of Statistics , 1436–1462.

Mizuno, H., Kitada, K., Nakai, K. & Sarai, A. (2009). Prognoscan: a new databasefor meta-analysis of the prognostic value of genes. BMC medical genomics 2, 18.

Mogi, A. & Kuwano, H. (2011). Tp53 mutations in nonsmall cell lung cancer. BioMedResearch International 2011.

Muller, P. (1992). Alternatives to the gibbs sampling scheme .

Nelson, D. R., Koymans, L., Kamataki, T., Stegeman, J. J., Feyereisen, R.,Waxman, D. J., Waterman, M. R., Gotoh, O., Coon, M. J., Estabrook,R. W. et al. (1996). P450 superfamily: update on new sequences, gene mapping, accessionnumbers and nomenclature. Pharmacogenetics and Genomics 6, 1–42.

Nielsen, T. D. & Jensen, F. V. (2009). Bayesian networks and decision graphs. SpringerScience & Business Media.

Patil, G. P., Joshi, S. W. & Rao, C. R. (1968). A dictionary and bibliography ofdiscrete distributions. International Statistical Institute.

Pearl, J. (2014). Probabilistic reasoning in intelligent systems: networks of plausibleinference. Morgan Kaufmann.

Pearl, J. & Verma, T. S. (1995). A theory of inferred causation. Studies in Logic and theFoundations of Mathematics 134, 789–811.

Pellet, J.-P. & Elisseeff, A. (2008). Using markov blankets for causal structurelearning. Journal of Machine Learning Research 9, 1295–1342.

Plant, N. (2007). The human cytochrome p450 sub-family: transcriptional regulation,inter-individual variation and interaction networks. Biochimica et Biophysica Acta (BBA)-General Subjects 1770, 478–488.

Preetam, N., Alain, H. & Maathuis, M. H. (2016). High-dimensional consistency inscore-based and hybrid structure learning. arXiv:1507.02608 .

Ravikumar, P., Wainwright, M. J., Lafferty, J. D. et al. (2010). High-dimensionalising model selection using 1-regularized logistic regression. The Annals of Statistics 38,1287–1319.

Robinson, M. D. & Oshlack, A. (2010). A scaling normalization method for differentialexpression analysis of rna-seq data. Genome biology 11, R25.

Scutari, M. & Denis, J.-B. (2014). Bayesian networks: with examples in R. CRC press.

Spirtes, P. (2010). Introduction to causal inference. Journal of Machine Learning Research11, 1643–1662.

99

Spirtes, P., Glymour, C. N. & Scheines, R. (2000). Causation, prediction, and search.MIT press.

Sultan, M., Schulz, M. H., Richard, H., Magen, A., Klingenhoff, A., Scherf,M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D. et al. (2008).A global view of gene activity and alternative splicing by deep sequencing of the humantranscriptome. Science 321, 956–960.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological) , 267–288.

Tsamardinos, I., Aliferis, C. F. & Statnikov, A. (2003a). Time and sample efficientdiscovery of markov blankets and direct causal relations. In Proceedings of the ninth ACMSIGKDD international conference on Knowledge discovery and data mining. ACM.

Tsamardinos, I., Aliferis, C. F., Statnikov, A. R. & Statnikov, E. (2003b).Algorithms for large scale markov blanket discovery. In FLAIRS conference, vol. 2.

Tsamardinos, I., Brown, L. E. & Aliferis, C. F. (2006). The max-min hill-climbingbayesian network structure learning algorithm. Machine learning 65, 31–78.

Verma, T. S. & Pearl, J. (1991). Equivalence and synthesis of causal models. InUncertainty in artificial intelligence, vol. 6.

Wan, Y.-W., Allen, G. I., Baker, Y., Yang, E., Ravikumar, Pradeep & Liu, Z.(2015). Package ”huge”: High-dimensional undirected graph estimation.

Wang, W., Baladandayuthapani, V., Morris, J. S., Broom, B. M., Manyam, G.& Do, K.-A. (2013). ibag: integrative bayesian analysis of high-dimensional multiplatformgenomics data. Bioinformatics 29, 149–159.

Yahav, I. & Shmueli, G. (2012). On generating multivariate poisson data in managementscience applications. Applied Stochastic Models in Business and Industry 28, 91–102.

Yang, E., Allen, G., Liu, Z. & Ravikumar, P. K. (2012). Graphical models viageneralized linear models. In Advances in Neural Information Processing Systems.

Yang, E., Ravikumar, P. K., Allen, G. I. & Liu, Z. (2013). On poisson graphicalmodels. In Advances in Neural Information Processing Systems.

Yang, X., Zhang, B., Molony, C., Chudin, E., Hao, K., Zhu, J., Gaedigk,A., Suver, C., Zhong, H., Leeder, J. S. et al. (2010). Systematic genetic andgenomic analysis of cytochrome p450 enzyme activities in human liver. Genome research 20,1020–1036.

Yaramakala, S. & Margaritis, D. (2005). Speculative markov blanket discovery foroptimal feature selection. In Data mining, fifth IEEE international conference on. IEEE.

100

Yuan, M. & Lin, Y. (2007). Model selection and estimation in the gaussian graphicalmodel. Biometrika 94, 19–35.

Zhao, J., Li, P., Feng, H., Wang, P., Zong, Y., Ma, J., Zhang, Z., Chen, X.,Zheng, M., Zhu, Z. et al. (2013). Cadherin-12 contributes to tumorigenicity in colorectalcancer by promoting migration, invasion, adhersion and angiogenesis. Journal of translationalmedicine 11, 288.

Zhao, T., Li, X., Liu, H., Roeder, K. & Larry, J. L. (2015). Package ”huge”:High-dimensional undirected graph estimation.

Zhou, H., Brekman, A., Zuo, W.-L., Ou, X., Shaykhiev, R., Agosto-Perez,F. J., Wang, R., Walters, M. S., Salit, J., Strulovici-Barel, Y. et al. (2016).Pou2af1 functions in the human airway epithelium to regulate expression of host defensegenes. The Journal of Immunology 196, 3159–3167.

101

BIOGRAPHICAL SKETCH

Suwa Xu was born in Yixing, Jiangsu, China. She attended Yixing High School in 2004

and was accepted into statistics Program by University of South China Agricultural University

in 2007.

Suwa Xu graduated in 2011 with a bachelor degree in statistics and later was accepted

into Department of Statistics at Rice University as a master student.

After obtaining a master degree, Suwa Xu came to Gainesville for further study as a Ph.D.

student of biostatistics at University of Florida. She received her Ph.D. from the University of

Florida in the summer of 2017.

102

LEARNING HIGH-DIMENSIONAL GRAPHICAL …ufdcimages.uflib.ufl.edu/UF/E0/05/12/88/00001/XU_S.pdfLIST OF...

Documents

Transcript of LEARNING HIGH-DIMENSIONAL GRAPHICAL …ufdcimages.uflib.ufl.edu/UF/E0/05/12/88/00001/XU_S.pdfLIST OF...