LEARNING HIGH-DIMENSIONAL GRAPHICAL …ufdcimages.uflib.ufl.edu/UF/E0/05/12/88/00001/XU_S.pdfLIST OF...
Transcript of LEARNING HIGH-DIMENSIONAL GRAPHICAL …ufdcimages.uflib.ufl.edu/UF/E0/05/12/88/00001/XU_S.pdfLIST OF...
LEARNING HIGH-DIMENSIONAL GRAPHICAL MODELS FOR GENERAL TYPES OFRANDOM VARIABLES
By
SUWA XU
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2017
c⃝ 2017 Suwa Xu
I dedicate this to everyone that helped me
ACKNOWLEDGMENTS
Completing this dissertation would not have been possible without the support from the
people that have helped me remain focused, motivated and inspired throughout the years. I am
extremely fortunate to be surrounded by such amazing people.
First I would like to thank my advisor, Professor Faming Liang, for his support throughout
the duration of my time as a graduate student at UF. His wisdom and generosity will always
inspire me. Also, his cheerfulness and encouragement have been essential in giving me the
space that allowed me to discover myself in the field of biostatistics.
I owe thanks to all of my committee members, Professor Yang Yang, Professor Fei Zou
and Professor Samuel Wong for their useful and constructive comments and advice.
I would also like to thank my friends and fellow graduate students at UF for their
company. I cannot imagine what my life is going to be without them.
Last but not least I would like to thank my family, my Mom, Dad and grandparents.
Without their support, I would never been here for my PhD study.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.1.1 Introduction to Graphical Models . . . . . . . . . . . . . . . . . . . . 111.1.2 Graph Notation and Terminology . . . . . . . . . . . . . . . . . . . . 121.1.3 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Markov Network and Markov Properties . . . . . . . . . . . . . . . . . . . . . 141.3 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Introduction to Bayesian Network . . . . . . . . . . . . . . . . . . . . 151.3.2 Constraint-based Approaches . . . . . . . . . . . . . . . . . . . . . . . 201.3.3 Score-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 221.3.4 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 ψ-learning Algorithm for Learning Gaussian Graphical Models . . . . . . . . . 26
2 UNDIRECTED GRAPHICAL MODEL FOR COUNT DATA . . . . . . . . . . . . . 32
2.1 RNA-seq Data and Poisson Graphical Models . . . . . . . . . . . . . . . . . . 322.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Data-Continuized Transformation . . . . . . . . . . . . . . . . . . . . 342.2.2 Data Gaussianized Transformation . . . . . . . . . . . . . . . . . . . . 362.2.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.4 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.1 Liver Cytochrome P450s Subnetwork . . . . . . . . . . . . . . . . . . 432.4.2 Acute Myeloid Leukemia mRNA Sequencing Network . . . . . . . . . 45
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 BAYESIAN NETWORKS FOR MIXED DATA . . . . . . . . . . . . . . . . . . . . 51
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 A Brief Review of Bayesian Network Theory . . . . . . . . . . . . . . . . . . 523.3 Learning High-Dimensional Bayesian Networks . . . . . . . . . . . . . . . . . 54
3.3.1 Learning the Moral Graph . . . . . . . . . . . . . . . . . . . . . . . . 543.3.2 Identifying v -structures . . . . . . . . . . . . . . . . . . . . . . . . . . 583.3.3 Identifying Derived Directions . . . . . . . . . . . . . . . . . . . . . . 60
5
3.3.4 Consistency of the Proposed Method . . . . . . . . . . . . . . . . . . 623.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.1 Mixed Data for an Undirected Graph . . . . . . . . . . . . . . . . . . 643.4.2 Mixed Data for a Directed Graph . . . . . . . . . . . . . . . . . . . . 66
3.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.5.1 Lung Cancer Genetic Network . . . . . . . . . . . . . . . . . . . . . . 703.5.2 Glioblastoma Genetic Network with Methylation Adjustment . . . . . . 74
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 CONCLUSIONS AND FUTURE RESEARCH . . . . . . . . . . . . . . . . . . . . . 79
APPENDIX
A CONSISTENCY OF TRANSFORMATION-BASED METHOD . . . . . . . . . . . . 81
A.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81A.2 Existing Theory of Adaptive MCMC . . . . . . . . . . . . . . . . . . . . . . . 81A.3 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B CONSISTENCY OF PROPOSED THREE-STAGE METHOD . . . . . . . . . . . . 87
B.1 Consistency of Moral Graph Learning . . . . . . . . . . . . . . . . . . . . . . 87B.2 Consistency of v -structure Identification . . . . . . . . . . . . . . . . . . . . 93
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6
LIST OF TABLES
Table page
1-1 Conditional independences represented by Markov networks. . . . . . . . . . . . . . 16
2-1 The posterior mean and standard deviation of αi , βi and θij for one simulated variable,
where a1 = a2 = a and b(0)1 = b
(0)2 = b
(0) . . . . . . . . . . . . . . . . . . . . . . 42
3-1 Outcomes of binary decision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3-2 Average areas under the Precision-Recall curves produced by the three-stage andpseudo-likelihood methods. The number in parenthese represents the standard deviationof the areas averaged over 10 datasets. . . . . . . . . . . . . . . . . . . . . . . . . 65
3-3 Average precision and recall of the directed graphs produced by three-stage, PC,HC and MMHC algorithms. The number in parentheses represents the standard deviationsof the value averaged over 10 datasets. . . . . . . . . . . . . . . . . . . . . . . . . 68
7
LIST OF FIGURES
Figure page
1-1 An example of a graphical model. Each arrow indicates a dependency. In this example:I depends on J, J depends on I, J depends on K and K depends on I . . . . . . . . 11
1-2 Illustrative plot for calculation ψ-partial correlation coefficients, where the solid anddotted edges indicate the direct and indirect associations, respectively. The left andright shaded ellipses cover, respectively, the reduced neighborhoods of node i andnode j in the correlation graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2-1 Left: Scatter plot of the continuized data versus raw counts for one variable. Right:QQ-plot of the Gaussianized data for one continuized variable. . . . . . . . . . . . . 41
2-2 Precision-recall curved produced by the proposed method (Cont+NPN+ψ-learning),log-transformation-based (Log+NPN+ψ-learning), log transformation-based gLasso(log+NPN+gLasso), log transformation-based nodewise regression (log+NPN+nodewiseregression), LPGM, SPGM, TPGM for the simulated data with n, p = (100, 200). . 41
2-3 Precision-recall curves of each method for different type of structures with (n, p) =(100, 200). Upper left: hub; upper right: scale-free; lower left: small-world; lowerright: random. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2-4 Precision-recall curves of each method for different type of structures with (n, p) =(500, 200). Upper left: hub; upper right: scale-free; lower left: small-world; lowerright: random. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2-5 Left: P450 gene regulatory subnetwork produced from Yang et al. (2010), wherethe known regulators and P450 genes are shown as blue rectangles and red ovals,respectively. Right: the subnetwork produced by the proposed method. . . . . . . . 46
2-6 GRN produced by the proposed method for the AML RNA-seq data with (n, p) =(179, 500). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2-7 Log-log plots of the degree distributions of the four networks generated by the proposedmethod (upper left), gLasso (upper right), nodewise regression (lower left), and LPGM(lower right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3-1 BMP format drawing. Note: no filetype is designated by adding an extension. . . . 61
3-2 A smaller version of the graph structure underlying the simulation study, where thecircle nodes represent Gaussian variables, the square nodes represent Bernoulli variables,and the solid, dotted and dashed lines represent three different types if edges. . . . . 64
3-3 Precision-Recall curves produced by the three-stage and pseudo-likelihood methodfor two mixed datasets: the left generated under the setting (n, pc , pd) = (500, 100, 100)and the right under the setting (n, pc , pd) = (100, 100, 100). . . . . . . . . . . . . 65
3-4 The true directed network for a dataset with n = 3000 samples. . . . . . . . . . . . 69
8
3-5 The estimated directed network for a dataset with n = 3000 samples. . . . . . . . . 70
3-6 The Bayesian network produced by the three-stage method with the mRNA (circlenodes) and mutation (square nodes) data measured on the same set of 121 LUSC(Lung Squamous Cell Carcinoma) . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3-7 The Bayesian network produced by MMHC with the mRNA (circle nodes) and mutation(square nodes) data measured on the same set of 121 LUSC (Lung Squamous CellCarcinoma) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3-8 The Bayesian network produced by HC with the mRNA (circle nodes) and mutation(square nodes) data measured on the same set of 121 LUSC (Lung Squamous CellCarcinoma) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3-9 Directed Glioblastoma genetic network learned by the three-stage method with methylationeffects having been adjusted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
LEARNING HIGH-DIMENSIONAL GRAPHICAL MODELS FOR GENERAL TYPES OFRANDOM VARIABLES
By
Suwa Xu
August 2017
Chair: Faming LiangMajor: Biostatistics
Graphical models have recently become a popular tool to study conditional independence
relationships among a large number of variables. There are two branches of graphical models,
Bayesian networks and Markov networks, among which Bayesian networks are directed acyclic
graphs while Markov networks do not contain direction information. We consider to learn
associations for general types of random variables in this thesis, e.g., count data and mixed
data. The existing method dealing with count data is Poisson graphical model. However
it is not consistent and can only infer certain local structures of the network. Meanwhile,
the network structure is difficult to learn when the number of variable p is greater than the
sample size n. Moreoever, in practice, the existing methods dealing with Bayesian network
work primarily with discrete data sets. The contribution of this thesis include a transformation
based algorithm for constructing networks from count data and a three-stage method for
learning Bayesian network from mixed data under the small-n-large-p scenario. The numerical
results indicate that the proposed methods significantly outperform the existing methods. The
proposed methods are feasible to construct genetic network on different types of genomic data,
such as microarry, RNA-seq and mutation data.
10
CHAPTER 1INTRODUCTION
1.1 Graphical Models
1.1.1 Introduction to Graphical Models
Graphical models have recently become a popular tool to study associations networks
for a large number of variables, where the variables can refer to genes, proteins, SNPs, or any
other subjects depending on the problem under study. Generally speaking, graphical models
use a graph-based representation as the foundation for encoding a complete distribution over
a multi-dimensional space and a graph that is a compact or factorized representation of a set
of independences that hold in in the specific distribution. One can think of Graphical models
as a marriage between graph theory and probability theory. Figure 1.1.1 shows an example of
graphical model.
A graph G consists a set of vertices V and a set of edges E joining some pairs of the
vertices. In graphical model, each vertex represents a random variable, and the graph gives a
visual representation of the joint distribution of the entire set of random variables. If the graph
has only undirected edges it is an undirected graph, also known as Markov random field or
Markov network. In these graphs, the absence of an edge between two vertices are conditionally
independent, given other variables. If all edges are directed, the graph is said to be directed.
There is an active literature on directed graphical models or Bayesian networks; these are
graphical models in which the edges have directional arrows (but no directed cycles). Directed
Figure 1-1. An example of a graphical model. Each arrow indicates a dependency. In thisexample: I depends on J, J depends on I, J depends on K and K depends on I
11
graphical models represent probability distributions that can be factored into products of
conditional distributions, and have the potential for casual inference. Both families of graphical
models encompass the properties of factorization and independences, but they differ in the set
of independences they can encode and factorization of the distribution that they induce.
1.1.2 Graph Notation and Terminology
A graph is called a complete graph if each pair of vertices is connected by an edge.
A subset is complete if it induces a complete subgraph. If there is an arrow from vertex u
pointing towards vertex v , u is said to be a parent of v and v a child of u. The set of parents
of v is denoted as Pa(v) and the set of children of u as Pa(u).
Two vertices u and v are called adjacent if there is an edge joining them; this is denoted
by u ∼ v . If there is no edge between u and v , i.e., u ∼ v , then u and v are said to be
non-adjacent. The set of neighbors of vertex u is denoted as ne(u). A path < v1, ..., vn >
from v1 to vn of an undirected graph, G = (V ,E), is blocked by a set S ⊆ V if v2, ..., vn−1∩
S = ∅. There is similar concept for paths of acyclic, directed graph, but the definition is
based on d-separation (will discuss later). A graph G = (V ,E) is connected if for any pair
u, v ⊂ V , there is a path u, ..., v in G. A connected graph G = (V ,E) is a tree if for any
pair u, v ⊂ V , there is a unique path < u, ..., v > in G.
A cycle is a path, < u, ..., v >, of length greater than two with the exception that u = v ;
a directed cycle is defined in the obvious way. A directed graph with no directed cycles is called
an acyclic, directed graph or simply a DAG.
1.1.3 Conditional Independence
If X , Y , Z are random variables with a joint distribution P, we say that X is conditionally
independent of Y given Z under P, and write X ⊥ Y |Z [P], if, for any measurable set A in the
sample space of X , there exists a version of the conditional probability P(A|Y ,Z) which is a
function of Z alone. Usually P will be fixed and omitted from the notation.
12
When X , Y and Z are discrete random variables the condition for X ⊥ Y |Z can be
written as
P(X = x ,Y = y |Z = z) = P(X = x |Z = z)P(Y = y |Z = z),
where the equation holds for all z with P(Z = z) > 0. When the three variables admit a joint
density with respect to a product measure µ, we have
X ⊥ Y |Z ⇔ fXY |Z(x , y |z) = fX |Z(x |z)fY |Z(y |z),
where this equation is to hold almost surely with respect to P.
The conditional relation X ⊥ Y |Z has the following properties, where f denotes an
arbitrary measurable function on the sample space of X :
(a) if X ⊥ Y |Z then Y ⊥ X |Z ;
(b) if X ⊥ Y |Z and U = f (X ), then U ⊥ Y |Z ;
(c) if X ⊥ Y |Z and U = f (X ), then X ⊥ Y |(Z ,U);
(d) if X ⊥ Y |Z and X ⊥W |(Y ,Z), then X ⊥ (W ,Y )|Z .
Another property of the conditional independence relation is,
(e) if X ⊥ Y |Z and X ⊥ Z |Y then X ⊥ (Y ,Z).
However the above property does not hold in general, but only under additional conditions
which states that the joint density of all variables with respect to a product measure is positive
and continuous.
A semi-graphoid is an algebraic structure which satisfies (a)-(d) where X , Y , Z are
disjoint subsets of a finite set and U = f (X ) is replaced by U ⊂ X (Pearl, 2014). If property
(e) also holds for disjoint subsets, it is called a graphoid.
A very important example of a model for the conditional independence properties is that
of graph separation in an undirected graphs. Let A, B and C be subsets for the vertex set V
of a finite undirected graph G = (V ,E). Define
A ⊥g B|C ⇔ C separates A from B in G.
13
The graph separation has the following properties:
(a*) if A ⊥g B|C then B ⊥ A|C ;
(b*) if A ⊥g B|C and U is a subset of A, then U ⊥ BC ;
(c*) if A ⊥g B|C and U is a subset of B, then A ⊥g B|(C ∪ U);
(d*) if A ⊥g B|C and A ⊥g D|(B ∪ C), then A ⊥g (B ∪D)|C .
Even the analogue of (d) holds when all involved subsets are disjoint. Therefore graph
separation satisfies the graphoid axioms.
1.2 Markov Network and Markov Properties
Markov network is also known as Markov random field, is a model over undirected graph.
Associated with an undirected graph G = (V ,E) and a collection of random variable
(Xi)i∈V there are three different types of Markov properties.
A probability measure P on X is said to obey
(P) the pairwise Markov property, relative to G, if for any pair (Xi ,Xj) of non-adjacentverticesXi ⊥ Xj |XV \i ,j
(L) the local Markov property, relative to G, if for any vertex Xi ∈ VXi ⊥ XV \ne(i)∪i|Xne(i), where ne(i) is the set of neighbors of i .
(G) the global Markov property, relative to G, if for any triplet (A,B,S) of disjoint subsetsof V such that S separates A from B in GA ⊥ B|S .
The above three Markov properties are not equivalent: The global Markov property is
stronger than the local Markov property, which in turn is stronger than the Pairwise one. If
it holds for all disjoint subset A, B, C and D that if A ⊥ B|(C ∪ D) and A ⊥ C |(B ∪ D)
then A ⊥ (B ∪ C)|D, then the Markov properties are all equivalent. This condition is an
analogue to (e) and holds, for example, if P has a positive and continuous density with respect
to a product measure µ, that is, the set of graphs with associated probability distributions that
satisfy the pairwise, local and global Markov properties are the same. For example, for the
undirected graphical model or Markov network, A − B − C − D . A ⊥ D|(B ∪ C)
14
since there is no edge between A and D. But B also separates A from C and D and therefore
by the global Markov property we conclude that A ⊥ C |B and A ⊥ D|B. Similarly we have
B ⊥ D|C .
The global Markov property allows us to decompose graphs into smaller pieces. In this
case, the graph is easier to manage. For this purpose, we separate the graph into cliques.
A clique is a complete subgraph. It is maximal if it is a clique and no other vertices can be
added to it and still yield clique. For example, the maximal clique of A − B − C − D is
A,B, B,C, C ,D.
Given a set of random variables X = (Xi)i∈V , let P(X = x) be the probability of a
particular filed configuration x in X . That is, P(X = x) is the probability of finding that the
random variables X take on the particular value x . The joint density over a Markov graph G
can be represented as
P(X = x) =∏c∈cl(G)
ϕc(xc) (1–1)
where cl is the set of maximal cliques and the positive functions ϕc(·) are called clique
potentials.
It implies a graph with independence properties defined by the cliques in product.
The results holds for Markov network G with positive distributions, which is also known as
Hammersley-Clifford theorem (Hammersley & Clifford, 1971).
1.3 Bayesian Network
1.3.1 Introduction to Bayesian Network
Bayesian network is a model over a directed acylic graph. Since the directions of the
edges are included on the model, it can represent more types of conditional independences
than the Markov network. That is, Bayesian networks can provide more accurate descriptions
than Markov networks for the relationships among random variables. For example, consider a
set of random variables, there can be four combinations of independences statements for any
two variables. Table 1-1 given an example for each of the three cases that are representable by
Markov networks. The fourth case that X and Y are marginally independent, but dependent
15
Table 1-1. Conditional independences represented by Markov networks.
Conditional Marginal Independenceindependence X ⊥ Y X ⊥ YX ⊥ Y |Z X Y – Z X – Z – Y
X ⊥ Y |Z non-representable X – Y – Z
conditioned on variable Z , is not representable by a Markov network, However, it can be easily
represented by a Bayesian network using a v -structure X → Z ← Y , which includes two
convergent directions on the edges X – Z and Y – Z . In the Bayesian formula, this
situation can be described by
π(X ,Y |Z) = π(Z |X ,Y )π(X )π(Y )π(Z)
= π(X |Z)π(Y |Z),
which quite often holds for real problems. In Bayesian networks, the direction of edges
represents the ”parent of” relationship. For this reason, Bayesian networks have often been
used in causal inference, see e.g. (Spirtes, 2010).
There are several equivalent definitions of a Bayesian network. A Bayesian network
B = (G,XV ) is composed of a Directed Acyclic Graph (DAG) G = (V ,E) and a p-dimensional
random vector X = x1, ..., xp, with a probability density that recursively factorizes according
to the DAG
P(x) =∏i
q(xi |pa(i)),
where pa(i) denotes the parents of Xi , and q(xi |pa(i)) denotes a conditional distribution of Xi
given pa(i). Therefore, if we know both the conditional independence relations about variables
in X and a set of local probability distributions associated with each variable, we can recover
the joint distribution for X .
A Bayesian network encodes a set of independencies that exist in the domain. To
guarantee the existence of these independencies in the actual population distribution, one
assumption needs to be satisfied which demonstrates that there exists no common unobserved
variables in the domain that are parent of one or more observed variables of the domain.
16
If P factorizes recursively according to G, then P factorizes according to the moral graph
Gm and obeys the global Markov property relative to Gm. Gm is constructed by marrying the
parents and deleting the directions. The converse is not true. Many independence relations in
P are not captured by Gm. This is because P has extra properties not possessed by a general
distribution that factorizes according to the undirected graph Gm.
Since P factorizes according to Gm, P satisfies the global, local and pairwise Markov
properties with respect to Gm. The conditional independence relations can be read off from
Gm. Consider the local Markov property, we have Xi ⊥ XV \ne(i)∪i|Xne(i). Since moral graph is
obtained by marrying parents, ne(i) can be written as pa(i) ∪ ch(i) ∪ j |ch(j) ∩ ch(i) = ∅,
where ch(i) = children of i in G. ne(i) is called the Markov blanket of i in DAG G, denoted
by MB(i). Therefore the local Markov property can be written as Xi ⊥ XV \MB(i)∪i|MB(i). It is
a rewrite of the local Markov property with respect to moral graph Gm.
For two vertice u, v of G, we say that u is an ancestor of v and v is a descendant of u,
if there is a path from u to v . Let the set of ancestors of v denote by an(v ) and the set of
descendants of v denote by de(v ). A is called an ancestral set if pa(v) ⊆ A,∀v ∈ A, denoted
by An.
Similarly, there are directed global, local and pairwise properties on directed graphs. The
directed global Markov property is defined as XI ⊥ XJ|XU holds whenever I and J are separated
by U in GmAn(I∪J∪U), the moral graph of the smallest ancestral set containing I ∪ J ∪ U. The
directed global Markov property in directed acyclic graph is an analogue as global Markov
property in the case of an undirected graph, in the sense that it gives the sharpest possible rule
for reading conditional independence relations off the directed graphs.
We say that P satisfies the directed local Markov property if each variable is conditionally
independent of its non-descendant given its parents variables:
Xi ⊥ XV \de(i)|Xpa(i) for all i ∈ V ,
17
where de(i) is the set of descendants and V \ de(i) is the set of non-descendants of i . In
contrast to the undirected case we have directed local Markov property and directed global
Markov property are equivalent.
Lastly, P obeys the direct pairwise Markov property if for any pair (i , j) of non-adjacent
vertices with j ∈ V \ de(i),
Xi ⊥ Xj |XV \de(i)\j for all i ∈ V .
The directed local Markov property implies the directed pairwise Markov property. The reverse
is not true in general.
Bayesian network can be defined through directed Markov property B is a Bayesian
network with respect to G if it satisfies the directed Markov property. Other definitions of
Bayesian network are based on Markov blanket and d-separation. B is a Bayesian network
with respect to G if every node is conditionally independent if all other nodes in the network,
given its Markov blanket. This definition can be made more general by d-separation. B is a
Bayesian network with respect to G if, for every triplet of disjoint sets I, J,U ⊂ V, it holds
that XI ⊥ XJ|XU whenever U d-separates I from J. Lauritzen (1996) proves the equivalence
between D-separation and directed global Markov property. Let I, J and U be the disjoint
subsets of a DAG G. Then U d-separates I from J in GmAn(I∪JU).
The process of solving Bayesian network involves structure learning and parameter
learning. The first step is to induce the structure of the model, that is, the DAG, while the
second step is to estimate the parameters of the model defined by the structure. In reality,
once the graph structure is selected, the parameter estimation problem is then reduced to a set
of lower dimensional problems. Estimating parameters can be done using standard techniques
like maximum likelihood, Bayesian estimation or regularized maximum likelihood. For example,
if we use Bayesian estimation method. A prior distribution is assumed over the parameter
of local probability density functions before data are used and the conjugacy of this prior
distribution is usually desirable.
18
Under the conditions listed belows, the structure learning algorithms considered will
discover a DAG structure equivalent to the DAG structure of the probability distribution P
(Spirtes et al., 2000):
• The independence relationships have a perfect representation as a DAG. This is alsoknown as the DAG faithfulness assumption.
• The database consists of a set of independent and identically distributed cases.
• The database of cases is infinitely large.
• No hidden (latent) variables are involved.
• The statistical tests has no error.
Two DAGs representing the same set of conditional independence relations are equivalent
in the sense that they capture the same set of probability distributions. That is, two models
M1 and M2 are statistically equivalent if and only if they contain the same set of variables and
joint samples over then provide no statistical grounds for preferring one over the other.
Any two models M1 and M2 over the same set of variables, whose graphs G1 and G2,
respectively, have the same skeleton and the same v -structure are equivalent. That is, two
DAGs G1 and G2 are equivalent of they have the same skeleton and the same set of uncovered
colliders (i.e. Xi → Xj ← Xk , structures where Xi and Xk are not connected by a link also
known as v -structures). For instance, the models Xi → Xj → Xk and Xi ← Xj ← Xk and
Xi ← Xj → Xk are equivalent since they have the same skeleton and have no v -structures.
Based on the data alone, we cannot distinguish Xi → Xj → Xk and Xi ← Xj ← Xk and
Xi ← Xj → Xk . These models can, however, be distinguished from Xi → Xj ← Xk .
It is important to note that although Bayesian networks are often used to represent causal
relationships, this need to be the case because of the existence of equivalent class.
A causal network is a Bayesian network with an explicit requirement that the relationships
be causal. The additional semantics of the causal networks specify that if a node X is actively
caused to be in a given state x (an action written as do(X = x)), then the probability density
function changes to the one of the network obtained by cutting the links from the parents of X
19
to X, and setting X to the caused value x. Using these semantics, one can predict the impact
of external interventions from data obtained prior to intervention.
Building the structure of Bayesian network is a difficult task, subjects to the above
listed conditions, and provably correct algorithms have an exponential worst-case complexity.
Identifying the exact Bayesian network structure is in general impossible due to the existence
of equivalent class. There exists some methods dealing Markov network which have much
better complexity. One possible solution one may think of is to extend some existing methods
which can deal with Markov network to learn Bayesian networks. However, none of the existing
methods developed for high-dimensional GGMs, e.g., graphical Lasso, nodewise regression and
ψ-learning, can be trivially extended to Bayesian networks dues to fundamental differences in
their structures. In particular, the v -structure need to be taken case specially when extending a
Markov network learning algorithm to Bayesian networks.
The existing Bayesian network learning methods can be traced to three categories,
constraint-based, score-based and hybrid.
1.3.2 Constraint-based Approaches
In constraint-based approaches, constraints typically refer to those conditional independence
statements, although non-independence based constraints may be entitled by the structure, in
certain cases where latent variables exist. The conditional independence tests usually can be
read off graph using d-separation criterion. Structure learning in this case is then the task of
identifying a DAG structure that best encodes a set of conditional independence relations. The
set of conditional independence relations may, for example, be derived from the data source by
statistical tests. However based on the data set alone, we can at most hope to identifying an
equivalence class of graphs encoding the conditional independence relations of the generating
distribution.
A constraint-based algorithm is based on the independence tests I (X ,Y |SXY ), which
indicates X is conditional independent of Y given subset SXY . In this case, any information
source able to provide such information works.
20
The most straightforward algorithm proposed is induction causation (IC) algorithm
algorithm (Pearl & Verma, 1995).
IC Algorithm
1. For each pair of vertices Xi and Xj , search for a set Sij such that Xi and Xj areconditional independent given Sij . If there is no such Sij , place an undirected edgebetween these two vertices.
2. For each pair of non-adjacent vertices Xi and Xj with a common neighbor XK , check ifXk ∈ Sij .If it is, then continue. If it is not, then add arrowheads pointing at Xk , (i.e. Xi → Xk ←Xj)
3. Orient as many of the undirected edges as possible subject to two conditions: (i) theorientation should not create a v-structure; and (ii) the orientation should not create adirected cycle.
However, this algorithm requires a number of conditional independence tests that
increase exponentially in the number of vertices. Even for sparse graphs the algorithm
becomes infeasible as the number of vertices increases. Besides the computational burden,
the determination of higher order conditional relations from sample distribution is generally
less reliable than is the determination of lower order independence relations. After addressing
the intractability issues, PC algorithm (Kalisch & Buhlmann, 2007) has been proposed. Since
it is enough to find one Sij making Xi and Xj independent to remove their connection, PC
algorithm proposed to do the test in certain order. The revised step is as follows:
1. If (i , j) not adjacent then i and j are d-separated either given pa(i) or pa(j):
– If all edges between a vertex k and i , and between k and j have already beenremoved, then sets S∗ such that T ∈ S∗ are need not to be considered in search fora separating set Sij .
– Restrict search for separating sets S such that either S ⊆ Adj(Xi) or S ⊆ Adj(Xj);Adj(Xv) refers to the set of vertices in the graph that are adjacent to v .
2. For each pair of variables Xi and Xj , see if Xi ⊥ Xj ; if so, remove their edge.For each pair of variables Xi and Xj which are adjacent in the graph, withmax|Adj(Xi)|, |Adj(Xj)| ≥ 2, test Xi ⊥ Xj |S , where |S | = 1 and S ⊆ Adj(Xi) orS ⊆ Adj(Xj)....
21
For each pair of variables Xi and Xj which are adjacent, with max|Adj(Xi)|, |Adj(Xj)| ≥k + 1, test Xi ⊥ Xj |S , where |S | = k and S ⊆ Adj(Xi) or S ⊆ Adj(Xj).Stop when we reach a k such that for all (Xi ,XJ), max|Adj(Xi)|, |Adj(Xj)| < k + 1
Other constraint based algorithm include the grow-shrink (GS) algorithm (Margaritis
& Thrun, 1999), incremental association (Tsamardinos et al., 2006). All these methods
are originally developed for low dimensional problems (Aliferis et al., 2010). Also, they may
involve some conditional test with the size of conditioning set close to p, which cannot be
carried out or very unreliable when p is greater than n. It is remarkable that under the sparsity
assumption which bounds the neighborhood size of each node, the PC algorithm has been
shown by Kalisch & Buhlmann (2007) to be consistent and can execute in a polynomial time
of p. Therefore, the PC algorithm has been considered in the literature as the state-of-the-art
method for learning high-dimensional Bayesian network. Recent applications and extensions of
the algorithm can be founded in Colombo et al. (2012), Verma & Pearl (1991), Harris & Drton
(2013), McGeachie et al. (2014), Ha et al. (2015), among others.
1.3.3 Score-based Approaches
The score-based algorithms view the problem of solving Bayesian network as a model
selection problem. They assign each candidate network structure a score function which
measures how well the model fits the observed data. The problem then becomes how to find
the highest-score network structure.
The scoring function can be entropy (Herskovits & Cooper, 2013), minimum description
length (Lam & Bacchus, 1994) and Bayesian scores (Heckerman et al., 1995). Under
appropriate conditions, the score-based methods can also be shown to be consistent, see
Chickering (2002) and Preetam et al. (2016) for low and high dimensional cases, respectively.
Given the graph structure G and complete data D.
Score(G,D) = P(G|D),
22
which is the posterior probability of G given the data set. We can use Bayes’ law:
Score(G,D) = P(G|D) = P(D|G)P(D)
We only need to maximize the numerator since the denominator does not depend on G. There
are several ways to calculate P(G). For simplicity, we ignore P(G), which is the same as
assuming a uniform prior on the structures.
We use θ to specify the parameters,
P(D|G) =∫P(D|G,θ)P(θ|G)dθ.
In the large sample limit the term P(D|G,θ)P(θ|G) can be reasonably approximated as
a multivariate Gaussian. Given the maximum likelihood estimation value θ and ignoring terms
that do not depend of the data set size N, the BIC score approximation can be written as
BICscore(G,D) = logP(D|θ,G)− d2log(N),
where d is the number of free parameters. The usefulness of the BIC score score comes from
the fact that it does not depend on the prior over the parameters, which makes popular in
practice where prior information is not available or is difficult to obtain.
Score-based algorithms attempt to optimize score, returning the structure G that
maximizes it. This poses many problems since the space of all possible structures is at
least exponential in the number of variables p: there are p(p − 1)/2 possible undirected edges
and 2p(p−1)/2 possible structures for every subset of these edges. What is more, the direction
of each edge is undetermined. Therefore it is not possible to calculate the score for every
possible Bayesian network structure and instead the heuristic search algorithms are employed in
practice.
A simple greedy search type of algorithm is the Hill-climbing (HC) algorithm.
Hill-Climbing Algorithm
1. Start with an initial graph structure G, e.g. the empty structure.
23
2. Repeat as long as Score(G) increases:
– provided that the operation given an acyclic graph G∗: add, delete or reverse an arcof G,
– compute the score of the new graph Score(G∗),
– if Score(G∗) > Score(G), set G = G∗ and Score(G)=Score(G∗).
Another approach of finding the best score is simulated annealing, which randomly
considers operators. If an uphill step is induced, the system will move to the new state. If
a downhill step is induced, the system will move to the new state with probability which is
inversely proportional to the reduction in score. The temperature is slowly lowered and finally
the global maximum is found.
Unfortunately, the task of finding a network structure that optimizes the scoring function
is NP-hard (Chickering, 1996), and the search process often stops at a local optimal structure.
1.3.4 Hybrid Approaches
Hybrid approaches combine the constrain-based and score-based technique to offset their
respective weakness. Both the sparse candidate algorithm (Friedman et al., 2008) and the
max-min hill-climbing (MMHC) algorithm (Tsamardinos et al., 2006) belong to the category.
The idea of these algorithms is that if we directly apply the greedy HC algorithm, the search
space could be huge. There is a need to develop methods to increase chances of building
a good quality model without exploring the whole search space exhaustively. One possible
approach is to use less computationally expensive method to determine a promising subset of
the search space on which we can subsequently apply a more systematic and costly method.
MMHC algorithm combines a constraint-based method (Max-Min-Parent-Children algorithm)
(Tsamardinos et al., 2003a) and a score-based algorithm (HC algorithm). The algorithm first
identifies the parents and children set of each variable, then performs a greedy hill-climbing
search in the space of Bayesian network. The search begins with an empty structure and
then add, delete, reverse an edge whichever can lead to the increase of score. The important
24
difference of MMHC algorithm from standard HC algorithm is that the search is constrained to
only consider adding an edge if it was discovered by MMPC in the first phase.
Not any probability distribution can be faithfully represented by a DAG. The typical
structure-learning algorithms can only deal with restricted range of data set. Faithfulness of
the distribution guarantees the existence of a DAG. Faithfulness along with Markov property
indicates that there is a one-to-one mapping between the graphical criterion of d-separation
and conditional independence in the data. In practice, existing score-based, constraint-based
and hybrid algorithm deal primarily with discrete data sets. It is known that score-based
algorithms for continuous variables are computationally expensive. GS algorithm proposed by
Magrassi et al. (2005) adopted a distribution-free test of conditional independence, however
it is computationally expensive and cannot be readily used with the current constraint-based
algorithms for all but small networks.
Variable selection is a commonly used method to reduce the number of variables for
building more robust models. The central premise when using a variable selection technique
is that the data contains many variables that are either redundant or irrelevant, and can
thus be removed without incurring much loss of information. Variable selection and casual
structure learning share one concept: the Markov blanket of a variable X is the smallest set
which contains all variables having information about X but cannot be obtained from other
variables. The Markov blanket in a causal graph includes the set of parents, children and
spouses. In variable selection, we call variables having information about the target that cannot
be obtained from other variables as strongly relevant variables. The variable selection process
and the causal graph construction process are somehow similar as Markov blanket identification
process. It is shown that the Markov blanket of a variable X is exactly the set of strongly
relevant variables and prove its uniqueness for faithfulness distributions (Tsamardinos et al.,
2003a).
25
1.4 ψ-learning Algorithm for Learning Gaussian Graphical Models
This section provides a brief review of ψ-learning algorithm (Liang et al., 2015) for
learning Gaussian Graphical models.
During the past decade, the Gaussian graphical model (GGM), as a special case of Markov
Networks, has been widely studied. The idea of learning Gaussian graphical model is to use
the partial correlation coefficient. A zero partial correlation coefficient indicated conditional
independence of the two variables. There also exits another way to measure dependency
which is is based on the correlation coefficient. However, the latter is less powerful due to
the fact that all variables in a system are more or less correlated. A variety of methods have
been proposed for constructing Gaussian graphical models from observed data. A popular
method is covariance selection (Dempster, 1972), which identifies the nonzero elements in
the concentration matrix (i.e, inverse of covariance matrix) because the nonzero entries in the
concentration matrix correspond to the conditional dependent variable. Furthermore, Lauritzen
(1996) showed that the partial correlation coefficient between X (i) and X (j) given all other
variables can be expressed as
ρij |V \i ,j = −Ci ,j√Ci ,iCj ,j
, i , j = 1, ..., p, (1–2)
where Ci ,j denotes the (i , j)-entry of the concentration matrix, and V = 1, 2, ..., p denotes the
set of indices of all variables of a system. However, this approach cannot be applied in the case
of p > n, where the sample covariance matrix is singular and thus the concentration matrix
can no longer be directly estimated. To tackle this difficulty, regularization methods such
as nodewise regression (Meinshausen & Buhlmann, 2006) and graphical Lasso (Yuan & Lin,
2007; Friedman et al., 2008; Danaher et al., 2014) have been proposed. Nodewise regression
uses Lasso ((Tibshirani, 1996)) as a variable selection method to identify the neighborhood
of each variables, which corresponds to the nonzero elements of the concentration matrix.
A neighborhood is the set of predicator variables with nonzero coefficient in a regression
model estimated separately for each variable. Meinshausen & Buhlmann (2006) showed that
26
this method asymptotically recovers the true graph. To avoid estimating a large number of
regressions, Yuan & Lin (2007) proposed to directly estimate the concentration matrix using
the regularization method with a l1-penalty. The method is then accelerated by Friedman
et al. (2008) using a coordinate descent algorithm that was originally designed for Lasso
regression and this led to the so-called graphical Lasso algorithm. Another popular method
to learn Gaussian graphical model is based on limited order partial correlations. An important
algorithm belonging to this category is the PC algorithm (Spirtes et al., 2000), which works
in an iterative procedure: It starts with a full graph with edges between all variables, and then
for each edge of the current graph, it searches for a subset Q such that the two variables
connected by the edge are conditional independent given Q. If such a set Q is found, then the
corresponding edge is removed. Since the PC algorithm searches for the maximum of a set of
p-values, it can be very slow when p is large. Quite recently, Liang et al. (2015) proposed the
ψ-learning method, which works on an equivalent measure of partial correlation coefficients
calculated with reduced conditional sets. Let ψij denote the equivalent measure of the partial
correlation coefficient ρij |V \i ,j. They are equivalent in the sense that
ψij = 0 ⇐⇒ ρij |V \i ,j = 0, (1–3)
provided that the GGM satisfies the Markov property and adjacency faithfulness condition.
The GGM can be represented by an undirected graph G = (V,E), where V, with a slight
abuse of notations, denotes the set of p vertices corresponding to the p variables X 1, ...,X (p),
and E = (eij) denotes the adjacency matrix. If two vertices i , j ∈ V form an edge, and we say
that i and j are adjacent and set eij = 1. The boundary set of a vertex v ∈ V, denoted by bG,
is the set of vertices adjacent to v , that is, bG(v) = j : evj = 1. The boundary set is also
called neighborhood. A path of length l > 0 from v0 to vl is a sequence v0, v1, ..., vl of distinct
vertices such that evk−1,vk = 1 for all k = 1, ..., l . The subset U ⊂ V is said to separate I ⊂ V
from J ⊂ V if for every i ∈ I and j ∈ J, all paths from i to j have at least one vertex in U. For
a pair of vertices i = j with eij = 0, a set U ⊂ V is called an i , j-separator if it separates
27
i and j in G. Let Gij be a reduced graph of G with eij being set to zero. Then both the
boundary set bGij (i) and bGij (j) are i , j-separators in Gij .
Let XV denote a random vector indexed by V = 1, ..., p with probability distribution
PV. Let A ⊂ V be a subset of V, and let PA be the marginal distribution associated with the
random vector indexed by A. For a triplet I, J,U ⊂ V, we use XI ⊥ XJ|XU to denote that XI is
conditional independent of XJ given XU.
Let rij denote the correlation coefficient of variable X (i) and X (j). Let G = (V, E) denote
the correlation graph of X (1), ...,X (p), where E = (eij) is the adjacency matrix with eij = 1 if
|rij | > 0 and 0 otherwise. Let rij denote the empirical correlation coefficient of Xi and Xj , let ri
denote a threshold value, and let Eri ,i = v : |riv | > rj \ i denote a reduced neighborhood
of node i in the empirical correlation graph. For convenience, we define Erj ,j = v : |rjv | > rj,
Eri ,i ,−j = v : |riv | > ri \ j, and Erj ,j ,−i = v : |rjv | > rj \ i. For any pair of vertices i and
j , we define the partial correlation coefficient ψij by
ψij = ρij |Sij , (1–4)
where Sij = Eri ,i ,−j if |Eri ,i ,−j | < |Erj ,j ,−i | and Sij = Eri ,i ,−j otherwise, and |D| denotes the
cardinality of the set D. To distinguish ψij from conventional partial correlation coefficients, we
call it ψ-partial correlation.
Definition 1. (Markov property) We say that pV satisfies the Markov property with respect to
G if for every triplet of disjoint I, J,U ⊂ V, it holds that XI ⊥ XJ|XU whenever U separates I
and J in G.
Definition 2. (Adjacency faithfulness) We say that PV satisfies the adjacency faithfulness
condition with respect to G: If two variables X(i) and X(j) are adjacent in G, then they are
dependent conditioned on any subset of XV \ i , j.
The adjacency faithfulness condition implies that if there exists a subset U ⊆ V \ i , j
such that X (i) ⊥ X (j)|XU, then X (i) and X (j) are not adjacent in G. Furthermore, by the
28
Figure 1-2. Illustrative plot for calculation ψ-partial correlation coefficients, where the solid anddotted edges indicate the direct and indirect associations, respectively. The left andright shaded ellipses cover, respectively, the reduced neighborhoods of node i andnode j in the correlation graph.
Markov property, we have
X (i) ⊥ X (j)|XU =⇒ X (i) ⊥ X (j)|XV\i ,j for any U ⊆ V \ i , j.
In particular, if U = ∅, we have
X (i) and X (j) are marginally independent =⇒ X (i) ⊥ X (j)|XV\i ,j,
or, equivalently,
ρij |V\i ,j =⇒ corrX (i),X (j) = 0.
where corrij denotes the correlation coefficient of Xi and Xj , and ρij |V\i ,j denote the partial
correlation coefficient of Xi and Xj conditioned on all other variables. Since the essence of
the Gaussian graphical model is to find the pairs of random variables for which the partial
correlation coefficient is equal to zero, a correlation screening procedure can be applied to
reduce the size of conditioning set in calculating the partial correlation coefficient. Let ψij
denote the partial correlation coefficient calculated with the reduced conditioning set Sij , i.e.
ψij = ρij |Si ,j . Under Markov property and faithfulness condition, Liang et al. (2015) showed that
29
ψij is equivalnet to ρij |V\i ,j in learning the structure of the Gaussian graphical model in the
sense that
ψij = 0 ⇐⇒ ρij |V\i ,j = 0.
Further, under the mild conditions for the sparsity of the underlying GGM, Liang et al. (2015)
showed that the size of Sij can be bounded by n/log(n). Therefore, the ψ-learning algorithm
has successfully reduced the problem of partial correlation coefficient calculation from a
high-dimensional setting to a low dimensional one. Note that ρij |V\i ,j is even not calculable
when p is larger than n. In summary, the ψ-learning algorithm consists of the follow two steps
to calculate the ψ-learning correlation coefficients:
Algorithm 1.1. (ψ-learning algorithm)
(a) (Correlation screening) Determine the reduced neighborhood for each variable Xi .
(i) Conduct a multiple hypothesis test to identify the pairs of variables for which theempirical correlation coefficient is significantly different from zero. This step resultsin a so-called empirical correlation network
(ii) For each variable Xi , identify its neighborhood in the empirical correlation network,and reduce the size of neighborhood to O(n/log(n)) by removing the variableshaving lower correlation (in absolute value) with Xi . This step results in a so-calledreduced correlation network
(b) (ψ-calculation) For each pair of variables Xi and Xj , identify the separator Sij based onthe reduced correlation network resulted in step (a) and calculate ψij = ρij |Sij , whereρij |Sij denotes the partial correlation coefficient Xi and Xj calculated for the dataset Xconditioned on the variable Xl : l ∈ Sij.
(c) (ψ-screening) Conduct a multiple hypothesis test to identify the pairs of vertices forwhich ψij is significantly different from zero, and set the corresponding elements of E tobe 1.
The bound of neighborhood size is suggested to be set as n/[ξn log(n)], where ξn is a
tunable parameter and has a default value of 1. For some problems, one may set ξn > 1, say
2 or 3; if n is too small, one may set ξn < 1, say 1/2 or 1/3, while ensuring the condition
n/[ξn log(n)] < n − 4 holds. The ψ-learning algorithm is very convenient for incorporating our
prior knowledge into network construction. For example, if we know some pair of variables, say
30
Xi and Xk , are correlated, then we can always include Xk into the set Eri ,i and include Xi into
the set Erk ,k even if the empirical correlation between Xi and Xk is not strong enough.
We apply Fisher’s transformation to rij to get
zij =1
2log
[1 + rij1− rij
],
which approximately follows a normal distribution with mean 0 and variance 1/(n − 3)
under the null hypothesis H0 : rij = 0. Based on this asymptotic result, we calculate p-value for
the test H0 : rij = 0↔ H1 : rij = 0, and then apply the probit transformation to the p-value to
get
zij = Φ−1(1− 2[1−Φ(
√n − 3|zij |)]) = Φ−1(2Φ(
√n − 3|zij |)− 1), (1–5)
where Φ(·) denotes the cumulative distribution function of the standard normal distribution.
For convenience, we call zij a correlation score. Due to the monotonicity of the transformation
(1–5) in |zij | it converts a double-sided test H0 : zij = 0 ↔ zij = 0 to a single test
H0 : zij = 0 ↔ H1 : zij > 0. Therefore, it can be used as a test statistic for identification of
non-zero correlation coefficients.
Similarly, we let ψij denote the empirical value of ψij . Applying Fisher’s transformation to
ψij , we get
z ′ij =1
2log
[1 + ψij
1− ψij
], (1–6)
which approximately follows a normal distribution with mean 0 and variance 1/(n − |Sij − 3).
Then, we calculate the p-value for the corresponding test and apply the probit transformation
to the p-value to get
z ′ij = Φ−1
(2(Φ(2
√n − |Sij | − 3|z ′ij |)− 1
), (1–7)
which is called ψ-score. Similarly, it can be used as a test statistic for identification of non-zero
partial correlation coefficients and thus the structure of the Gaussian graphical model.
31
CHAPTER 2UNDIRECTED GRAPHICAL MODEL FOR COUNT DATA
2.1 RNA-seq Data and Poisson Graphical Models
In recent years, next generation sequencing (NGS) has gradually replaced microcarray
as the major platform in transcriptome studies, say, through sequencing RNAs (RNA-seq).
RNA-seq uses counts of reads to quantify gene expression levels. Compared to microarray-data,
RNA-seq data have many advantages, such as providing digital rather than analog signals of
expression levels, dynamic and wider ranges of measurements, less noise, higher throughput,
etc. However, their discreteness also challenges the existing methods. In practice, RNA-seq
data are often modeled using Poisson (Sultan et al., 2008) or negative-binomial distribution
(Anders & Huber, 2010; Robinson & Oshlack, 2010), but difficulties often arise in the
computation or knowing the properties of the statistics based on these distributions.
Let Y = (Y1, ...,Yp) denote a p-dimensional Poisson random vector associated with a
graphical model G. It is natural to assume that all the node-conditional distributions, that
is, the conditional distribution of one variable given all other variables, are Poisson with the
distribution given by
P(Yj |Yk ,∀k = j ;Θj) = exp
[θjYj − log(Yj !) +
∑k =j
θjkYjYk − A(θjθjk)
]. (2–1)
where Θj = θj , θjk , k = j, and A(θj , θjk) is the log-partition function of the Poisson
distribution. Following from the Hammersley-Clifford theorem (Besag, 1974), the node-conditional
distribution combine to yield the joint Poisson distribution
P(Y,Θ) = exp
[P∑j=1
(θjYj − log(Yj !)) +∑j =k
θjkYjYk − ϕ(Θ)
](2–2)
where Θ = (Θ1, ..., Θp) and ϕ(Θ) is the normalizing term ensuring the properness of this
distribution. However, the Poisson graphical model suffers from a major caveat: the interaction
parameters θjk must be nonpositive for all j = k to ensure ϕ(Θ) to be finite and thus the
distribution P(Y,Θ) to be proper (Besag, 1974; Yang et al., 2012). Therefore, the Poisson
32
graphical model only permits negative conditional dependencies, which is a severe limitation
in practice. As shown in Patil et al. (1968), the negative binomial graphical model also suffers
from the same limitation.
To relax this limitation, Allen & Liu (2013) proposed a local Poisson graphical model
(LPGM), which ignores the joint distribution of Yj ’s, and works by finding a local model
for each gene using a regularization method based on the conditional distribution (2–1) and
then defining the network structure as the union of the local models. To account for the
high dispersion of the NGS data when the inter-sample variance is greater than the sample
mean, Gallopin et al. (2013) proposed a hierarchical log-normal Poisson model which assumes
Yij ∼ Poisson(λij) with log(λij) =∑k =j βjk yik + ϵij , for j = 1, ..., n, where ϵij is a Gaussian
random variable, and yik denotes standardized, log-transformed data. For each variable Yi , the
local model can be found via a regularization approach for the log-normal Poisson regression.
Quite a few related models have been proposed along this direction, including the truncated
PGM, quadratic PGM, sub-linear PGM and square-root PGM. Refer to Yang et al. (2012) and
Inouye et al. (2016) for the detail. However, this LPGM-based methods are non consistent
due to their ignorance of the joint distribution of Y ′j s . Without the joint distribution, the
conditional dependence Yk ⊥ Yj |YV\k,j is not well defined and therefore the theoretical
basis Yk ⊥ Yj |YV\k,j ⇐⇒ θkj = 0 and θjk = 0 of the nodewise regression (Meinshausen
& Buhlmann, 2006; Ravikumar et al., 2010) does not hold, where θkj and θjk are defined in
equation (2–1). Hence, linking the Poisson graphical model to nodewise Poisson regression will
not lead to a consistent estimate for the underlying network.
We propose a random effect model-based transformation for RNA-seq data. This
transformation transforms count data to continuous data, which can be further transformed to
Gaussian data via a semiparametric transformation as described in Liu et al. (2009). Then, we
adopt the ψ-learning method developed in Liang et al. (2015) to construct Gaussian graphical
models (GGMs) for the transformed data. Under mild regularity and sparsity conditions, we
33
show that the proposed method is consistent. Transforming count data to continuous data
greatly facilitates the analysis of NGS data.
The remainder of this chapter is organized as follows. Section 2.2 describes the
random effect model-based transformation, and gives a brief review for the semiparametric
transformation of Liu et al. (2009) and ψ-learning method of Liang et al. (2015). Section 2.3
illustrates the proposed method using simulated data along with comparisons with gLasso,
nodewise regression, LPGM, and some other existing methods.
2.2 Method
The proposed method consists of three steps: (i) data-continuized transformation, (ii)
data-Gaussianized transformation, and (iii) ψ-learning, which are described in sequel as follows.
2.2.1 Data-Continuized Transformation
To continuize the RNA-seq data, we propose a random effect model-based transformation.
Let Yij denote the RNA-seq expression of gene i = 1, ..., p and j = 1, ..., n, where p denotes
the number of genes and n denotes the number of subjects. We assume that
Yij ∼ Poisson(θij), θij ∼ Gamma(αi , βi), (2–3)
where αi and βi are two parameters of the Gamma distribution. It is easy to see that (2–3)
forms a random effect model with the gene-specific random effect modeled by a Gamma
distribution. If we integrate out θij from the joint distribution f (yij , θij |αi , βi), we will have
Yij distributed according to a negative binomial distribution NB(r , q) with r = βi and
q = αi/(1 + αi). Hence, the model (2–3) is quite flexible, which accommodates potential
overdispersion of the data.
To avoid an explicit specification for the values of αi and βi , we conduct a Bayesian
analysis for the model. For this purpose, we let αi and βi be subject to the prior distributions:
αi ∼ Gamma(a1, b1), βi ∼ Gamma(a2, b2).
34
where a1, b1, a2 and b2 are prior hyperparameters. By the assumption that αi and βi are
a priori independent, the full conditional posterior distribution of θij ,αi and βi are given as
follows,
f (αi |θij , βi , yi) ∝αa1−1i
Γn(αi)eαi(−b1+n logβi+
∑nj=1 log θij)
f (βi |αi , θij , yi) ∝ βnαi+a2−1i e−βi(∑nj=1 θij+b2)
∝ Gamma(nαi + a2,n∑j=1
θij + b2),
f (θij |αi , βi , yi) ∝ θyij+αi−1ij e−θij (1+βi ),
(2–4)
where yi = yij : j = 1, 2, ..., n. Regarding the choice of prior hyperparameters, we establish
the following lemma, whose proof is given in the Appendix.
Lemma 1. If a1 and a2 take small positive values, then for all i and j , the posterior mean of
θij , denoted by E [θij |yi ], will converge to yij as b1 →∞ and b2 →∞.
Suppose, that a MCMC algorithm, example, the Metropolis-within-Gibbs sampler (Muller,
1992), was used to simulate from the posterior distribution (2–4). Let θ(t)ij denote the posterior
sample of θij for t = 1, 2, ...,, and let θ(T )ij =∑Tt=1 θ
(t)ij /T denote the Monte Carlo estimator
of E [θij |yi ]. Then, following from the standard theory of MCMC, we have θ(T )ijp→ E [θij |yi ] as
T → ∞, wherep→ denotes convergence in probability. To ensure the convergence θ(T )ij
p→ yij
hold in a rigorous manner, the iteration number T and the prior hyper-parameteres b1 and
b2 need to go to infinity simultaneously. To achieve this goal, we let b(t)1 and b(t)2 denote the
respective values of b1 and b2 taken at iteration t, and we set
b(t)1 = b
(t−1)1 +
c
tζ, b2 = b
(t−1)2 +
c
tζ, t = 1, 2, ..., (2–5)
where b(0)1 and b(0)2 are fixed large constants, c > 0 is a small constant, and 0 < ζ ≤ 1. Under
this setting, the MCMC sampler for (2–4) forms an adaptive Markov chain for which the target
distribution gradually shrinks toward a Dirac delta measure defined on (αi , βi , θij) = (0, 0, yij).
For simplicity in theoretical development (see Appendix A), we assume that a random walk
35
proposal is used in simulating from the conditional posterior distribution f (αi |·), that is, the
proposal distribution q(α′i |α(t)i ) = q(|α′
i − α(t)i |) depends on |α′
i − α(t)i | only. In summary, we
have the following lemma, whose proof is given in the Appendix.
Lemma 2. If a random walk proposal is used in simulating from f (αi |·) and the prior
hyperparameters are chosen in (2–5), then θ(T )ijp→ Yij for all i and j as T → ∞, where
θ(T )ij =∑Tt=1 θ
(t)ij /T and θ(t)ij denotes the posterior sample of θij generated at iteration t.
Lemma 2 implies that the statistical inference for yij ’s can be approximately made using
θTij ’s as T → ∞. The validity of the approximation can be argued as follows: Let F(y1, ..., yp)
denote the empirical CDF of (Y1, ...,Yp). It is easy to see that the convergence θ(T )ijp→ yij
implies that supt∈Rp ∥Fθ(T )1 ,...,θ(T )p (t) − Fy1,...,yp(t)∥p→ 0 as T → ∞. Further, as the sample
size n → ∞, supt∈Rp ∥Fy1,...,yp(t) − FY1,...,Yp(t)∥a.s.→ 0 holds under some regularity and
sparsity conditions, where FY1,...,Yp(t) denotes the CDF of Yi ’s, anda.s.→ denotes almost sure
convergence. For example, we can assume that for each Yi , the number of variables that Yi
depends on us upper bounded by n/ log n. In summary, we have supt∈Rp ∥Fθ(T )1 ,...,θ(T )p (t) −
FY1,...,Yp(t)∥p→ 0 as T →∞, which implies that a consistent estimate can be formed based on
the continuized data for each conditional probability used for inference of the network structure
underlying Y1, ...,Yp. That is, the conditional independence relations among Y1, ...,Yp can be
learned from the continuized data θ(T )1 , ..., θ(T )p in a consistent manner.
2.2.2 Data Gaussianized Transformation
Since GGMs have been extensively stuided, we seek for a transformation that transforms
the continuized data to be Gaussian, while maintaining the conditional independence relations
among the variables. The semiparametric Gaussian copula transformation, the so-called
nonparanormal transformation, proposed by Liu et al. (2012) satisfies this requirement. It can
be described as follows.
Let X = (X1, ...,Xp)T be a continuous p-dimensional random vector. It is said that X
has a nonparanormal distribution if there exist function fj = j = 1p such that Z = f (x) ∼
N(µ, Σ), where f (x) = (f1(X1), ..., fp(Xp))T . We write X ∼ NPN(µ, Σ, f ). It is known that
36
if f ′j s are monotone and differentiable, the joint probability density function of X is given by
PX (x) =1
(2π)p/2|Σ|1/2exp
−12(f (x)− µ)TΣ−1(f (x)− µ)
·p∏j=1
|f ′j (xj)|. (2–6)
Based on this formula, Liu et al. (2009) argued that if X ∼ NPN(µ, Σ, f ) and each fj is
monontone and differentiable, then Xi ⊥ XjXv\i ,j ⇐⇒ Zi ⊥ Zj |Zv\i ,j. With the similar
argument, we have that for any triplet of disjoint sets A,B,C ⊆ V ,XA ⊥ XB |XC ⇐⇒
ZA ⊥ ZB |ZC . In other words, the nonparanormal transformation preserved the conditional
independence structure of the original graphical model formed by X. Liu et al. (2009) further
showed that fj(x) = µj + σjΦ−1(Fj(x)) is such a monotone and differentiable transformation,
where µj is the mean of Xj , σ2j is the variance of Xj , and Fj(x) is the CDF of Xj . For the high
dimensional case where p is greater than and case increase with n, Fj(x) can be replaced by
a truncated or Winsorized estimator of the marginal empirical distribution of Xj in order to
reduce the variance of the estimate.
As shown in Liang et al. (2015) , the ψ-learning method is consistent, that is, the network
produced by it will converge to the true one as the sample size n →∞.
The multiple hypothesis tests involved in the correlation screening and ψ-screening steps
can be done using an empirical Bayes method developed in Liang & Zhang (2008). The
advantage of this method is that it allows for general dependence between test statistics,
for example, Benjamini et al. (2006), can also be applied here. The performance of multiple
hypothesis tests depend on their significance levels. Following the suggestions of Liang et al.
(2015), we set the significance level of correlation screening to be α1 = 0.2 and that of
ψ-screening to be α2 = 0.05. In general, a high significance level of correlation screening will
lead to a slightly larger separator set Sij , which reduces the risk of missing some important
variables in the conditioning set. Including a few false variables in the conditioning set will not
hurt much the accuracy of ψ-partial correlation coefficients.
37
2.2.3 Consistency
In summary, the proposed method consists of three steps: (i) data-continuized transformation,
(ii) data-Gaussianized transformation, and (iii) ψ-learning for GGMs. From Lemma 2 and
the followed arguments, we can conclude that the network structure of Y1, ...,Yp can be
consistently learned from the continuized data θ(T )1 , ..., θ(T )p . Liu et al. (2009) showed that
the data-Gaussianized transformation preserves the network structure underlying the data,
and Liang et al. (2015) showed that the ψ-learning method is consistent in recovering the
underlying network structure. Therefore, the consistency holds for the proposed method;
that is, the true gene regulatory relations can be recovered from the RNA-seq data using the
proposed method when the sample size becomes large.
2.3 Simulation Studies
To illustrate the performance of the proposed method, we consider some simulation
examples with the known conditional independence structure. Since the most NGS data tend
to be zero-inflated and highly over-dispersed, the data were simulated from a multivariate
zero-inflated negative binomial (ZINB) distribution. The ZINB distribution contains three
parameter, λ, κ, and ω, which controls its mean, dispersion and degree of zero-inflation,
respectively. The algorithm developed by Yahav & Shmueli (2012) was adopted to simulate the
data, which works via an inverse nonparanormal transformation as follows:
(a) Simulate a random sample of n multivariate Gaussian random variables with the knownconcentration matrix. Denote the random sample by (X1, ...,Xp), where each variableXi = (Xi1, ...,Xin)
T consists of n realizations.
(b) For each variable Xi , find the empirical CDF based on n realizations and calculate thecumulative probability value for each realization Xij .
(c) Generate a random sample of n zero-inflated negative binomial random variables withpre-specified parameters λ, κ and ω by inverting the cumulative probability valuesobtained in (b).
38
In our simulations, we set the concentration matrix as follows:
Cij =
0.5, if |j − i | = 1, i = 2, ..., (p − 1),
0.25, if |j − i | = 2, i = 3, ..., (p − 2),
1, if j = i , i = 1, ..., p,
0, otherwise.
(2–7)
This matrix has been used by quite a few authors to demonstrate their GGM algorithms,
say, Yuan & Lin (2007), Mazumder & Hastie (2012), and Liang et al. (2015). To make the
simulation similar to the real world, we set the parameters λ,κ and ω of the ZINB distribution
to their estimates from a real dataset, Acute myeloid leukemia (AML) mRNA sequencing
data, which is available on The Cancer Genome Atlas (TCGA) data portal. We estimated
these parameters using the function ”glm.nb” in R for each gene, and then set the simulation
parameters to the medians of the estimates: λ = 515, 743, κ = 3.304 and ω = 0.003. For
the other parameters, we set n = 100 and p = 200. We then applied the proposed method
to the simulated data, which went through the steps of data-continuized transformation,
nonparanormal transformation, and ψ-learning. To measure the performance of the method,
we plot the precision-recall curve in Figure 2.3, which is drawn by fixing the significant level to
α1 = 0.2 and varying the value of α2, the significant level of ψ-learning.
To conduct the data-continuized transformation, the Metropolis-within-Gibbs sampler
was run for 10,000 iterations for this dataset, where the first 1000 iterations were discarded
for the burn-in process and the remaining iterations were used for inference. The total CPU
time cost by the sampler was 39.0 sec on a personal computer with 2.8GHZ Intel Core i7. On
average, it cost less than 0.2 sec per variable. For this transformation, we set α1 = α2 = 1,
b(0)1 = b
(0)2 = 10, 000, c = 1, and ς = 1, the default setting of the prior hyperparameters used
throughout the article. The left panel of Figure 2.3 shows the scatter plot of the continuized
data versus raw counts for one variable, and the right panel shows the Q-Q plot of the
Gaussianized data for the variable. The scatter plot indicated that the continuized data
39
and the raw counts are very close to each other. To have a thorough exploration for the
data-continuized transformation, we reported in Table 2-1 the posterior mean and standard
deviation of αi , βi , θij , and the AUC value, that is, the area under the precision-recall curve,
for measuing the performance of the proposed method. The results indicate again that θij
can be very close to yij and our method is robust to the choice of (a1, a2, b(0)1 , b
(0)2 ). The
data-continuized transformation does not lose much information of the raw counts.
For comparison, we have applied the existing methods, including gLasso, nodewise
regression, Local Poisson Graphical Model (LPGM), Truncated Poisson Graphical Model
(TPGM) and Sublinear Poisson Graphical Model (SPGM) to the simulated data. For
gLasso and nodewise regression, the simulated ZINB data first went through the logarithm
transformation and nonparanormal transformation, which have been widely used in RNA-seq
data analysis, and then the methods were applied. The gLasso and nodewise regression
methods have been implemented in the R-package huge Zhao et al. (2015). In our application,
the stability approach was used to determine their regularization parameters. The stability
approach selects the network with the smallest amount of regularization that simultaneously
makes the network sparse and replicable under the random sampling. For LPGM, we used
the method proposed by Allen & Liu (2013). For SPGM and TPGM, we used the method
proposed by Yang et al. (2013). The three methods have been implemented in the R-package
XMRF (Wan et al., 2015). Besides these existing methods, we also compared the proposed
method with the one without data-continuized process, that is, ψ-learning with logarithm and
nonparanormal transformation, which is labeled as ”Log+NPN+ψ-Learning” in Figure 2.3.
40
Figure 2-1. Left: Scatter plot of the continuized data versus raw counts for one variable. Right:QQ-plot of the Gaussianized data for one continuized variable.
Figure 2-2. Precision-recall curved produced by the proposed method(Cont+NPN+ψ-learning), log-transformation-based (Log+NPN+ψ-learning), logtransformation-based gLasso (log+NPN+gLasso), log transformation-basednodewise regression (log+NPN+nodewise regression), LPGM, SPGM, TPGM forthe simulated data with n, p = (100, 200).
41
Table 2-1. The posterior mean and standard deviation of αi , βi and θij for one simulated variable, where a1 = a2 = a and
b(0)1 = b
(0)2 = b
(0)
a b(0) Yij θij αi βi AUC
1104 513.37(284.47) 513.27(284.38) 3.01× 10−7(2.04× 10−6) 6.58× 10−6(6.64× 10−6) 0.940106 513.32(284.41) 513.32(284.41) 8.58× 10−7(5.99× 10−6) 9.46× 10−7(9.52× 10−7) 0.9411010 513.37(284.47) 513.37(284.47) 7.47× 10−7(5.41× 10−6) 9.87× 10−11(9.71× 10−11) 0.943
0.001104 513.44(284.43) 513.44(284.43) 6.54× 10−7(5.03× 10−6) 1.58× 10−8(5.10× 10−7) 0.941106 513.51(284.45) 513.51(284.45) 3.78× 10−7(2.15× 10−6) 1.15× 10−9(2.87× 10−8) 0.9411010 513.37(284.48) 513.37(284.48) 5.75× 10−7(3.56× 10−6) 6.24× 10−14(1.72× 10−12) 0.942
42
The comparison indicates that the proposed method significantly outperforms other
existing methods, although the improvement mainly comes from ψ=learning. The data-continuized
transformation does not loss the information of the data, and it provides a justification for the
empirical use of treating log-NGS data as continuous. Multiple datasets have been tried, the
results are very similar. Note that LPGM is an extension of the nodewise regression method
(Meinshausen & Buhlmann, 2006) to multivariate Poisson. Both the LPGM and nodewise
regression methods are based on the idea of neighborhood selection. This experiment also
shows that the data-continuized transformation and nonparanormal transformation improves
the performance of the neighborhood selections method. Based on this experiment, we suspect
that the graph consistency established in Meinshausen & Buhlmann (2006) for nodewise
normal regression might not hold for LPGM.
We have also considered several common network structures such as hub, scale-free,
small-world and random. The multivariate Gaussian random variables given these structures
can be generated by functions provided in ”huge” packages. Then, we continue steps (b) and
(c) of Yahav and Shmueli’s algorithm to get ZIBN samples with the same parameters as used
before, that is, (n, p) = (100, 200), λ = 515, 743, κ = 3.304 and ω = 0.003. The results are
summarized in Figure 2.3. It shows that the proposed method significantly outperform all other
methods for the scale-free, small world and random structures, and performs similarly to gLasso
and nodewise regression for the hub structure. To have a through comparison with the existing
methods, we also considered the scenario of n > p with the results reported in Figure 2.3.
2.4 Real Data Examples
2.4.1 Liver Cytochrome P450s Subnetwork
Liver cytochrome P450s play critical roles in drug metabolism, toxicology, and metabolic
processes. They form a superfamily of monooxygenases critical for anabolic and catabolic
metabolism in all organisms characterized so far (Nelson et al., 1996; Aguiar et al., 2005;
Plant, 2007). Specifically, P450 enzymes are involved in the metabolism of various endogenous
and exogenous chemicals, including steroids, bile acids, fatty acids, eicosanoids, xenobiotics,
43
Figure 2-3. Precision-recall curves of each method for different type of structures with (n, p) =(100, 200). Upper left: hub; upper right: scale-free; lower left: small-world; lowerright: random.
environmental pollutants, and carcinogens (De Montellano, 2005). Through experimental work,
Yang et al. (2010) determined the human liver transcriptional network structure, uncovered
subnetworks representative of the P450s gene regulatory network, as shown in the left panel of
Figure 2.4.1. The genes ”AK097548s”, ”BC019583”, ”ENST00000301162”, and ”NM-173466”
have been excluded from our study, as they are non protein-coding genes and their expression
data are not available in the original dataset. According to the proposed method, we first
applied the data-continuized transformation, we adjusted some effects that potentially affects
the distribution of the data, including the age, gender, and batch of data collections, through
linear regression. Then, we applied the nonparanormal transformation and ψ-learning method
to the adjusted data. Figure 2.4.1 shows the resulting subnetwork.
44
Figure 2-4. Precision-recall curves of each method for different type of structures with (n, p) =(500, 200). Upper left: hub; upper right: scale-free; lower left: small-world; lowerright: random.
2.4.2 Acute Myeloid Leukemia mRNA Sequencing Network
This example illustrates the performance of the proposed method in the small-n-large-p
scenario. The dataset is the mRNA sequencing data from AML patients and available at TCGA
data portal (http://cancergenome.nih.gov/). In this study, we directly worked on the raw count
data, which contains 179 patients and 19,990 genes. In preprocessing the data, we filtered out
some low expression genes: we first excluded the genes with at least one zero count, and then
selected 500 genes with largest inter-sample variance as suggested by Gallopin et al. (2013).
45
Figure 2-5. Left: P450 gene regulatory subnetwork produced from Yang et al. (2010), wherethe known regulators and P450 genes are shown as blue rectangles and red ovals,respectively. Right: the subnetwork produced by the proposed method.
Figure 2-6. GRN produced by the proposed method for the AML RNA-seq data with(n, p) = (179, 500).
46
The selected genes are more likely linked to the development of AML as their expression levels
are highly variable.
Figure 2.4.1 shows the GRN produced by the proposed method for the AML RNA-seq
data. Through this network, we can identify some hub genes that are likely related to AML.
A sub gene refer to a gene which has a strong connectivity to other genes. Our finding is
pretty consistent with the existing knowledge. For example, the hub gene MK167 is a well
known tumor proliferation marker. The prognostic value of MK167 protein expression has
been reported for many types of malignant tumors including brain, breast, and lung cancer,
with only a few exceptions for certain types of tumors (Mizuno et al., 2009). Another example
is the gene KLF6. Humbert et al. (2011) showed the expression patterns of KLFs with a
putative role in myeloid differentiation in a large cohort of primary AML patient samples,
CD34+ progenitor cells and granulocytes from healthy donors. They found that KLF2,
KLF3, KLF5, and KLF6 are significantly lower expressed in AML blast and CD34+ progenitor
cells compared to normal granulocytes, and that KLF6 is upregulated by RUNX1-ETO and
participates in the RUNX1-ETO gene regulation. This finding provides new insights into the
under-studied mechanism of RUNX1-ETO target gene upregulation and identifies KLF6 as a
potentially important protein for further study in AML development (DeKelver et al., 2013).
The biological functions of other hub genes, such as H3F3B and TMC8, will be further studied.
For comparison, gLasso, nodewise regression, and LPGM have been applied to this
dataset. They were run as for the simulated examples. Nodewise regression and gLasso were
run using the package huge under their default setting, but the regularization parameter was
determined using the stability approach. LPGM was run using the package XMRF under its
default setting. All these methods produced much denser networks than the proposed method.
To assess the quantity of networks produced by different methods, the power law curve (see,
e.g., Kolaczyk (2009), pp.80-85) was fit to them. A nonnegative random variable X is said to
have a power law distribution if
p(X = x) ∝ x−ν, (2–8)
47
for some positive constant ν. The power law states that the majority of vertices are of very low
degree, although some are of much higher degree. A network whose degree distribution follows
the power law is called a scale-free network and it has been verified that many biological
networks are scale-free, for example, gene expression networks, protein-protein interaction
networks, and metabolic networks (Barabasi & Albert, 1999). Figure 2.4.2 shows the log-log
plots of the degree distribution of the networks generated by four methods, where the curves
are fitted by the loess function in R. It shows that the network produced by the proposed
method approximately follows the power law, while those by gLasso, nodewise regression, and
LPGM do not.
2.5 Discussion
We have proposed a method for learning GRNs from RNA-seq data. The proposed
method is a combination of a random effect model-based data-continuized transformation,
the nonparanormal transformation, and the ψ-learning algorithm. The proposed method
is consistent in the sense that the true gene regulatory network can be recovered from the
RNA-seq data when the sample size becomes large. The major contribution of the proposed
method lies on the data-continuized transformation, which fills the theoretical gap of how to
transform NGS data to continuous data and facilitates learning of gene regulatory networks.
The proposed data-continuized transformation involves an adaptive Markov chain. We proved
the convergence and the weak law of large numbers for the adaptive Markov chain under
the framework provided by Liang et al. (2016). A strong law of large numbers (SLLN) can
potentially be proved for the algorithm under the framework provided by Fort et al. (2011).
With the SLLN, some stronger theoretical properties might be obtained for the resulting
networks.
In practice, some authors treated the logarithm of the RNA-seq data as continuous,
though not rigorous. The proposed method provides a justification for this use, which is
necessary and important given the popularity of NGS techniques. As discussed in Liang et al.
(2015), the ψ-learning algorithm provides a general framework for how to integrate multiple
48
Figure 2-7. Log-log plots of the degree distributions of the four networks generated by theproposed method (upper left), gLasso (upper right), nodewise regression (lowerleft), and LPGM (lower right).
sources of data in reconstructing Gaussian graphical networks, where it is proposed to use a
meta-analysis method to combine the ψ-partial correlation coefficients calculated from different
sources of data. Similarly, with the proposed method, we can integrate different types of omics
data, such as RNA-seq and microarray data, to improve inference for gene regulatory networks.
We expect that this method will be widely used in near future.
Finally, we note that alternative to the LPGM method, an existing method that can
potentially be used for Poisson graphical modeling is the latent copula Gaussian graphical
modeling method (Hoff, 2007; Dobra et al., 2011). The basic idea of this method is to
49
introduce Gaussian latent variables in place of discrete random variables in the Poisson network
inference. Since the method involves imputation for a large number of latent variables, it is
very slow and can only be applied to the problem with a small set of genes.
50
CHAPTER 3BAYESIAN NETWORKS FOR MIXED DATA
3.1 Introduction
We propose a new method for learning high-dimensional Bayesian networks. The
proposed method belongs to the category of constraint-based methods and can be viewed
as an extension of the ψ-learning method to Bayesian networks but with a special care
for v -structures. The proposed method consists of three stages namely, moral graph
learning, v -structure identification, and derived direction identification. It is to first learn
the moral graph of the Bayesian network using the ψ-learning algorithm, and then identify
the v -structures contained in the network based on conditional independence tests, and
finally identify the derived directions for non-convergent edges according to logical rules. The
moral graph, which is formally defined in Section 3.2, can be viewed as a Markov network
representation of the Bayesian network. The consistency of the three-stage method is justified
as small-n-large-p scenario. To illustrate the generality of the three-stage method, it is applied
to a variety of examples with mixed data, i.e., those consisting of both discrete and continuous
variables. The numerical results indicated that the proposed method significantly outperformed
the existing methods, including the PC algorithm. Under the sparsity assumption, the proposed
method has a computational complexity of O(p22m−1), while the computational complexity
of the PC algorithm is O(p2+m), where m is the maximum size of the Markov blanket of each
node.
The mixed data here is restricted to those consisting of Gaussian and multinomial/binomial
variables only. In this scenario, the joint distribution of mixed variable is well defined, see Lee
& Hastie (2013), for which the conditional distribution of each continuous variable given the
rest is still Gaussian and the conditional distribution of each discrete variable given the rest is
still multinomial. Therefore, all conditional independence tests involved in the proposed method
can be conducted under the framework of generalized linear models (GLMs). Extension of the
proposed method to other types of mixed data will be discussed in section 3.6.
51
3.2 A Brief Review of Bayesian Network Theory
In this section, we give a brief review for the Bayesian network theory required by the
article. For a full account of the theory, please refer to Nielsen & Jensen (2009) and Scutari &
Denis (2014).
As mentioned in Chapter 1, a Bayesian network can be represented by a directed acyclic
graph (DAG) G = (V,E), where V, with a slight abuse of notation, denotes a set of p nodes
corresponding to the p variables X1, ...,Xp, and E = (eij) denotes the adjacency matrix or arc
sets. The joint distribution of X1, ...,Xp is given by
P(X) =∏i
q(Xi |Pa(Xi)), (3–1)
where Pa(Xi) denotes the parent nodes/variables of Xi in the network, and q(·|·) specifies the
conditional distribution of Xi given the its parents nodes. In Bayesian network, each node Xi is
conditionally independent of its non-descendants(i.e. the nodes for which there is no path to
reach from Xi) given its parents. This is so-called local Markov property of Bayesian networks.
The local Markov property implies that the parents are not completely independent from their
children in the Bayesian network. With Bayes’ theorem, it is easy to show how information on
a child can change the distribution of parent. A convergent connection Xi → Xk ← Xj is called
a v -structure if there is no arc connecting Xi and Xj . In addition, Xk is often called a collider
node, and the convergent connection is than called an unshielded collider. The v -structure
enables Bayesian networks to represent a type of relationship that Markov networks cannot,
that is, Xi and Xj are marginally independent while they are dependent conditioned on Xk .
The Markov blanket of a node Xi , is the set of consisting of the parents of Xi , the children
of Xi , and the spouse nodes that share a child with Xi . The Markov blanket of a node Xi ∈ V
is the minimal subset of V such that Xi is independent of all other nodes conditioned on it.
The Markov blanket is symmetric, i.e., if node Xi is in the Markov blanket of Xj , then Xj is
also in the Markov blanket of Xi .
52
If the directions of all arcs in a Bayesian network are removed, the resulting undirected
graph is called the skeleton of the Bayesian network. Note that we can have Bayesian networks
with different arc sets that encode the same conditional independence relationships and
represent the same joint distributions. To illustrate this issue, we can consider the following
identity
P(Xi)P(Xj |Xi)P(Xk |Xj) = P(Xi |Xj)P(Xj)P(Xk |Xj),
where the left represents the serial connection Xi → Xj → Xk , and the right represents the
divergent connection Xi ← Xj → Xk . Such two Bayesian networks are said to belong the same
equivalent class. Two DAGs defined over the same set of variables are equivalent if and only if
they have the same skeleton and the same v -structure. Hence, in Bayesian networks, only the
directions of the arcs that are part of one or more v -structures are important.
The moral graph is an undirected graph that is constructed by (i) connecting the
non-adjacent nodes in each v -structure with an undirected arc, and (ii) ignoring the directions
of other arcs. This transformation is called moralisation, which provides a simple way ti
transform a Bayesian network into the corresponding Markov network. In the Markov network,
all dependencies are explicitly represented, even those that would be implicitly by v -structures
in Bayesian network. In the moral graph, the neighboring set of each nodes forms its Markov
blanket.
Finally, we given the definition of faithfulness of graphical models. Let M denote
the dependence structure of the probability distribution of X, i.e., the set of conditional
independence relationships between any triplet A,B,C of subsets of X. The graph G is said to
be faithful or isomorphic to M if for all disjoint subsets A,B,C of X, we have
X ⊥P B|C ⇐⇒ A ⊥G B|C (3–2)
where the left denotes the conditional independence in probability, and the right denotes the
separation in graph (i.e., C is a separator of A and B). For a Markov network, C is said to
be a separator of A and B if for every a ∈ A and b ∈ B, all paths from a to b have at least
53
one node in C. For Bayesian network, C is said to be separator of A and B if along every
path between a node in A and a node in B there is a node v satisfying one of the following
condition: (i) v has convergent arcs and neither v nor any of its descendant are in C, and (ii)
v is in C and does not have converging arcs. The faithfulness provides a theoretical basis for
establishing consistency for constraint-based methods.
3.3 Learning High-Dimensional Bayesian Networks
Based on the theory of Bayesian networks, we propose a three-stage method to learn the
structure of high-dimensional Bayesian networks: (i) learning the moral graph, (ii) identifying
v -structures, and (iii) identifying derived directions. Upon completion, the first two stages will
result in a partially directed acyclic graph (PDAG), which falls into the equivalent class of the
final Bayesian network. The third stage will identify the derived directions of non-convergent
edges based on some logical rules. The direction of an edge is said to be derived when there is
a logical consequence of previous actions.
3.3.1 Learning the Moral Graph
Under the assumption of faithfulness, the moral graph can be learned via conditional
independence tests Xi ⊥P Xj |Sij \ Xi ,Xj for all ordered pair of (i , j), where Sij denotes
the Markov blanket of Xi or Xj . If the conditional independence is true, then there is no arc
between Xi and Xj . Otherwise, Xi and Xj are in each other’s Markov blanket.
In the literature, quite a few algorithms have been proposed for learning Markov
blankets, e.g., the grow-shrink Markov blanket (Margaritis, 2003) and incremental association
(Tsamardinos et al., 2003b; Yaramakala & Margaritis, 2005) algorithms. The grow-shrink
Markov blanket algorithm works like a forward selection procedures, which first continues
to add new variables to conditioning set (starting with an empty set) until the conditional
independence holds or there are no more variables to add, and then shrinks the conditioning
set by removing the variables outside the blanket. The incremental association algorithm is
an enhancement of the grow-shrink Markov blanket algorithm, which reduces the number
of conditional tests by arranging the order of the variables to add to the conditioning set.
54
A fundamental problem with these algorithms is that they often need to perform some
conditional tests with the size of the conditioning set close to p. When p is greater than n,
such tests cannot be carried out or are very unreliable. Their computational complexity is
O(p2+a) for some 0 < a ≤ 1, where the factor pa accounts for the number of conditional
independence tests performed for each of p2 pairs of nodes. In the worst case that the graph is
fully connected, a is equal to 1 for all algorithms.
In what follows, we present a new algorithm for learning moral graphs, which can work
under the scenario n ≪ p, and has a computational complexity of O(p2) even in the worst
case. Instead of identifying the exact Markov blanket for each node, we propose to identify
a super Markov blanket Si for each node Xi such that Si ⊆ Si holds, where Si denote the
Markov blanket of the node Xi . Let ϕij denote the output of the conditional independence test
Xi ⊥P Xj |S \ Xi ,Xj, i.e., ϕij = 1 if the conditional independence test holds and 0 otherwise.
Let ϕij denote the output of the conditional independence test Xi ⊥P Xj |Si \Xi ,Xj. Theorem
3.1 shows that, under the faithfulness assumption, ϕij and ϕij are equivalent in learning moral
graphs.
Theorem 3.1. Assume the faithfulness holds. Let Si denote the Markov blanket of Xi , let Si
denote a superset Si . Then ϕij and ϕij are equivalent in learning moral graphs in the sense that
ϕij = 1 ⇐⇒ ϕij = 1.
Proof. If ϕij = 1, then Si\Xi ,Xj forms a separator of Xi and Xj . Since Si ⊂ Si , Si\Xi ,Xj.
By faithfulness, we have ϕij = 1.
On other hand, if Si = 1, then Xi and Xj are conditionally independent and Si \ Xi ,Xj
forms a separator of Xi and Xj . Since Si ⊂ V, V \ Xi ,Xj is also a separator of Xi and Xj
and the conditional independence Xi ⊥P Xj |V \ Xi ,Xj holds. By the total conditioning
property (property 7 in Pellet & Elisseeff (2008)), which shows that Xj ∈ Si ⇐⇒ Xi ⊥P
Xj |V \ Xi ,Xj, we have Xj ∈ Si . Therefore, ψij = 1 holds.
55
By the symmetry of Xi and Xj , Theorem 3.1 also holds if Si is replaced by Sj and Si
is replaced by Sj . Although ψij and ψij are equivalent in learning moral graphs, the size
of the supper Markov blanket Si should be as small as possible considering the power of
the conditional independence tests. A large Si often reduces the power of the conditional
independence test.
Based on Theorem 3.1, we propose the so-called p-screening algorithm for learning moral
graphs, which provides an efficient way to learn Markov blanket for each node simultaneously.
The algorithm consists of the following steps:
Algorithm 3.1. p-learning algorithm
(a) (Screening for parents and children nodes) Find a superset of parents and children foreach node Xi :
(i) For each unordered pair of nodes (Xi ,Xj), i , j = 1, 2, ..., p, conduct the marginalindependence test Xi ⊥P Xj and obtain the p-values.
(ii) Conduct a multiple hypothesis test to identify the pairs of nodes that are depen-dent. Denote the superset by Ai for i = 1, .., p. If the size of A + i is greaterthan n/(cn1log(n)) for a pre-specified constant cn1, reduce it to n/(cn1log(n)) byremoving the variables having large p-values in the marginal independence tests.
(b) (Spouse nodes amendment) For each node Xi , find the spouse nodes that are notincluded in Ai , i.e., finding the set Bi = Xj : Xj ∈ Ai ,∃Xk ∈ Ai ∩ Aj fori = 1, .., p, where Xj is a node not connected but sharing a common neighbor with Xi .If the size of Bi is greater than n/(cn2log(n)) for a pre-specified constant cn2, reduceit to n/(cn2log(n)) by removing the variables having larger p-values in the spouse testXi ⊥P Xj |Xk .
(c) (Screening for the moral graph) Construct the moral graph based on conditional indepen-dence tests:
(i) For each ordered pair of nodes (Xi ,Xj), i , j = 1, 2, ..., p, conduct the conditionalindependence test Xi ⊥P Xj |Sij\i , j, where Sij = |Ai∪Bi\i , j| ≤ |Aj∪Bj\i , j|and Sij = Aj ∪ Bj otherwise.
(ii) Conduct a multiple hypothesis test to identify the pairs of nodes for which they areconditionally dependent, and set the adjacency matrix Emb accordingly, where Embdenotes the adjacency matrix of the moral graph.
56
As annotated in the Algorithm 3.1, step (a) is to find a superset of parents and children
for each node. As pointed out in the Appendix, Ai also contains the spouse nodes that are
marginally dependent with Xi . Step (b) is to find the spouse nodes that are not included in
the superset Ai , i.e., the nodes that are marginally independent of Xi , but dependent with Xi
conditioned on their common child. Then, for each node Xi , we have Si ⊂ Ai ∩ Bi . Hence, we
can set set Si = Ai ∩ Bi . It follows from Theorem 3.1 that this algorithm is valid from learning
moral graph.
The multiple hypothesis tests involved in the algorithm can be done using an empirical
Bayes method developed in Liang & Zhang (2008). The advantage of this method is that
it allows for the general dependence between test statistics. Other multiple hypothesis tests
which accounts for the dependence between test statistics, e.g. Benjamini et al. (2006), can
also be applied here. The performance of multiple hypothesis tests depend on their significant
levels. Following from Theorem 3.1, a slightly large value of α1 should be used to reduce the
risk of Si ⊆ Ai ∩ Bi . On the other hand, the power of conditional independence tests in step
(c) is adversely affected by the size of the superset Si and thus by the value of α1. However,
we also found that such an effect is not very sensitive to the size of Si ; including a few extra
variables in Si will not hurt much the power of the moral graph screening tests. To balance
the two ends, we suggest to set α1 = 0.1 or 0.2. Throughout examples of this paper, we set
α1 = 0.2 and α2 = 0.05, unless otherwise stated.
In the algorithm, we have restricted the sizes of Ai and Bi based on the sparsity
assumption, given by Condition (C) of Section 3.2.4, for the high dimensional Bayesian
network. By assuming that each conditional distribution q(·) in (3–1) can be represented
by the probability distribution function of a normal linear regression or multiclass logistic
regression, we are able to bound the size of each set Ai by O(n/log(n)) based on the theory
of sure independence screening (Fan & Lv, 2008; Fan et al., 2010). Refer to the Appendix
B for the detail of the theoretical development. Further, under the sparsity assumption, we
are also able to bound the size of each set Bi by O(n/log(n)). Therefore, the size of each
57
superset Si = Ai ∩ Bi can be bounded by O(n/log(n)). With appropriate choices of cn1 and
cn2, we can always have |Si | < n holding for all i = 1, 2, ..., p when n is reasonably large. In
this paper, we set cn1 = cn2 = 1 for all examples. In practice, when the sample size n is small,
even the size of Bi is smaller than the pre-specified threshold, we might still conduct spouse
tests to reduce its size further. Since the size of Si adversely affects on the power of the moral
graph screening test, a smaller Bi is always preferred.
Since both the marginal test in step (a) and the conditional independence tests in step
(c) only need to be performed once for each ordered pair of nodes, and the multiple hypothesis
tests can be done in a linear time of the total number of the p-values, the computational
complexity of the p-screening algorithm is O(p2), which is independent of the underlying
structure of the Bayesian network.while, in the worst case, the computational complexity of the
existing algorithm is O(p3).
3.3.2 Identifying v -structures
Given the moral graph, the v -structure contained in the Bayesian network can be
identified by performing further conditional independence tests around each variable. With
the identified v -structures, the Markov blankets can be resolved by deleting the spouse links
and orienting the arcs in v -structures, the Markov blankets can be resolved by deleting the
spouse links and orienting the arcs in v -structure. This step can be be accomplished using
some existing algorithms, such as the collider algorithm (Pellet & Elisseeff, 2008) or the local
neighborhood algorithm (Margaritis & Thrun, 1999). In this paper, we adopted the collider set
algorithm and also made theoretical justification for the consistency of the algorithm under the
small-n-large-p scenario. Refer to Theorem B.2 in the Appendix B for detail.
According to the theory of Bayesian networks, only the arcs in v -structures can be
oriented. From that the moral graph is correct, only triangles can hide spouse links and
v -structures. For three nodes Xi ,Xj and Xk , they form a triangle if the edges Xi − Xj , Xi − Xk
and Xj − Xk all exists. Let Tri(Xi ,Xj) (with Xi , Xj ∈ V and (Xi ,Xj) ∈ E) = Xk ∈
V |(Xi ,Xk) ∈ Emb, (Xj ,Xk) ∈ Emb denote the the set of nodes that form a triangle with
58
Xi and Xj in the moral graph. Tri(Xi ,Xj) also denotes the interaction set of the Markov
blankets of Xi and Xj . Note that two spouses Xi and Xj that are not linked in the true graph
can be separated by some sets of nodes. Thus, if we can find a set Dij that makes Xi and
Xj conditionally independent, then the link between them is a spouse link to be removed.
Therefore, any node Xk ∈ Tri(Xi ,Xj) \ Bij is a collider and thus a common child and that
triplet (Xi ,Xj ,Xk) forms a v -structure Xi → Xk ← Xj . Let MB(·) denote Markov blanket
information for each node Xi ∈ V and BD(Xi) denote the boundary of Xi ; which is the set of
directed neighbor is graph G.
We set PDAG as the moral graph according to MB(·) and D as an empty list of
orientation directives. Then the collider set algorithm Pellet & Elisseeff (2008) is as follows.
Algorithm 3.2. (Collider set algorithm)
(a) For each edge Xi − Xj part of a fully connected triangle
(i) B is set to bethe smallest set of BD(Xi) \ Tri(Xi − Xj) \ Xj ,BD(Xj) \ Tri(Xi −Xj) \ Xi.
(ii) For each S ⊂ Tri(Xi − Xj) , Xk = B ∪ S . If Xi and Xj is condition-ally independent given Xk , then Sij = Xk , go to step (b). Set D to be B ∩nodes reachable by W in V \ i , j|W ∈ Tri(Xi ,Xj) \ S and B ′ to be B \D. Foreach S ′ ⊂ D, Xk = B ′ ∪ S ′ ∪ S . If Xi and Xj is conditionally independent given Xk ,then Sij = Xk , go to step (b).
(b) If Sij is not empty, mark link Xi − Xj as spouse link. For each Xk ∈ Tri(Xi − Xj) \ Sij ,D = D ∪ (Xi → Xk ← Xj).
(c) Remove all spouse links from graph G.
(d) For each orientation derivative (Xi → Xk ← Xj) ∈ C , if edges Xi − Xk and Xj − Xk stillexist in G, then orient edge as Xi → Xk ← Xj .
Step (a.ii) are based on two caveats during the collider set search. Firstly, there might be
d-connecting paths between Xi and Xj that are not going through any node of Tri(Xi − Xj).
Those nodes must be appropriately blocked. Secondly, the base conditioning set must be
checked not to include any descendants of possible colliders. Since no descendants of a collider
can be included in the separator set of Bayesian networks.
59
The complexity of the whole algorithm iterating over all triangle links, in terms of number
of conditional independence test, is O(pm2m−1), where m is the maximum size of the Markov
blanket of each node. The factor pm represents the total number of pairs of candidate spouse
nodes, and the factor 2m−1 represents the maximum number of conditional subsets that are
needed to consider for each pair of candidate spouse nodes. When the network is sparse, the
algorithm can perform reasonably fast. The local neighborhood algorithm (Margaritis & Thrun
(1999) has the same computational complexity. Pellet & Elisseeff (2008) pointed out that the
collider set algorithm has two major benefits. One is related to the triangle search. Given the
Markov blanket information is correct, only triangles can hide spouse links and v -structure
information. The other one is that for each connected pair Xi − Xj in a triangle, decisions
about spouse links and edge orientation are considered at the same time and thus faster.
3.3.3 Identifying Derived Directions
Upon completion of stage (ii) of Algorithm 3.1, the skeleton and colliders of the Bayesian
network can be identified; that is, we can get a PDAG in the equivalent class of the Bayesian
network. Given the skeleton and colliders, a maximally directed Bayesian network can be
obtained following the four necessary and sufficient rules: e.g., Verma & Pearl (1991), Meek
(1995) and Kjaerulff & Madsen (2008), which ensure that no cycles and additional colliders are
created in the graph.
(A) Since Xi → Xj − Xk is not a valid v -structure, Xj → Xk must be directed.
(B) Given the edges Xj → Xj → Xk , directing from Xk to Xi will produce a directed cycle, soXi → Xk must be directed.
(C) Since directing the edge Xj → Xi will inevitably produce an additional collider Xl →Xi ← Xk or a directed cycle, Xi → Xj must be directed.
(D) Since directing Xj → Xi will inevitably produce an additional collider Xj → Xi ← Xl or anadditional collider Xj → Xi ← Xl or a directed cycle, Xi → Xj must be directed.
These four rules can be repeatedly used until no edge can be directed. By the repeated
application, all edges common to the equivalence class of the Bayesian network can be
identified. The remaining edges may be directed using expert knowledge. Alternatively, an
60
A
B
C
D
Figure 3-1. Four necessary and sufficient rules for derived directions, where the indices I , J, Kand L are used to present the nodes Xi , Xj and Xk and Xl , respectively.
61
optimization procedure, such as simulated annealing (Kirkpatrick et al., 1983) and stochastic
approximation annealing (Liang et al., 2014), might be applied to direct the remaining edges
such that a selected score function is optimized.
3.3.4 Consistency of the Proposed Method
This subsection established the consistency of the proposed method; that is, the proposed
method is able to identify a PDAG in the equivalence class of the true Bayesian network as the
sample size n becomes large. To achieve the goal, we assume that the joint distribution (3–1)
of the Bayesian network can be re-expressed in the form
p(x, y|Θ) ∝
−12
pc∑s=1
pc∑t=1
θstxsxt +
pc∑s=1
νsxs +
pc∑s=1
pd∑j=1
ρsj(yj)xs +
pd∑j=1
pd∑r=1
ψrj(yr , yj)
(3–3)
where xs denotes the sthe pf pc continuous variables, and yj denotes the jth of the pd discrete
variables. The joint model is parametrized by Θ = [θst, νs, ρsj, ψrj]. As shown in Lee
& Hastie (2013), the conditional distribution of 3–3 are given by Gaussian linear regression and
multiclass logistic regressions. Therefore, all the conditional independence tests conducted in
the moral graph learning and the v -structure identification stages are well defined, which are
equivalent to test whether or not the corresponding regression coefficients equal to zero. To be
specific, the test in step (a) of the p-screening algorithm is equivalent to test the coefficient of
Xj in the GLM
Xi ∼ 1 + Xj , (3–4)
the test in step (c) of the p-screening algorithm is equivalent to test the coefficient of Xj in the
GLM
Xi ∼ 1 + Xj +∑k∈Sij
Xk , (3–5)
and the test in the v -structure identification stage is equivalent to test the coefficient of Xj in
the GLM
Xi ∼ 1 + XJ +∑j∈Dij
Xk , (3–6)
where Dij denotes a subset of BD(Xi) \ Xj.
62
Under the GLM assumption, the consistency of the proposed method can proved based on
the theory of sure independence screening established in Fan et al. (2010), the theory of the
ψ-learning algorithm established in Liang et al. (2015), and the theory established in Kalisch
& Buhlmann (2007) for the PC algorithm. Parallel to the conditioned assumed by the PC
algorithm for the Gaussian case, we assume the following conditions:
(A) (Faithfulness) The Bayesian network is faithful, for which the joint distribution can beexpressed in a Gaussian-multinomial distribution (3–3).
(B) (High dimensionality) The dimension pn = O(exp(nδ)), where 0 ≤ δ < (1−2κ)α/(α+2)
for some positive constants κ < 1/2 and α > 0, and the subscript n of pn indicates thedependence of the dimension p on the sample size n.
(C) (sparsity) The maximum size of the Markov blanket of each node, denote by qn =max1≤j≤pn |Si |, satisfies qn = O(nb) for some constant 0 ≤ b < (1 − 2κ)α/(α + 2),where Si denotes the Markov blanket of node i .
(D) (Identifiability) The regression coefficients satisfy
inf|βij |C; βij |C ≤ 0, i , j = 1, 2, ..., pn,C ⊆ 1, 2, ..., pn\i , j, |C| ≤ O(n/log(n)) ≥ c0n−κ,(3–7)
Since the stage of moral graph learning works based on the theory of sure independence
screening, we follow Fan et al. (2010) to given some conditions for GLMs (see Appendix B for
the detail) such that the resulting Bayesian network satisfies the sparsity condition (C). Fan
et al. (2010) showed that variable screening can be done in regression coefficient or in p-values
of the conditional independence tests, which are equivalent to each other. For this reason, the
identification condition (D) is given in terms of regression coefficients. Under these conditions,
we showed in Appendix that the proposed method is consistent, i.e., P(E(n)
mb = E(n)mb) → 1 and
P(E(n)
v = Ev |E(n)
mb = E(n)mb) → 1 as n → ∞, where E (n)mb denotes the adjacency matrix of the
moral graph. Ev denotes the set of v -structures and E(n)
mb and Ev denote the estimators of E (n)mb
and Ev obtained by the proposed method, respectively.
63
Figure 3-2. A smaller version of the graph structure underlying the simulation study, where thecircle nodes represent Gaussian variables, the square nodes represent Bernoullivariables, and the solid, dotted and dashed lines represent three different types ifedges.
Table 3-1. Outcomes of binary decision.
Actual Positive(P) Actual Negative(N)Predicted Positive True positive (TP) False positive(FP)Predicted Negative False negative (FN) True negative (TN)
3.4 Simulation Studies
3.4.1 Mixed Data for an Undirected Graph
This example was modified from Lee & Hastie (2015), which consists of two types of
variables, Bernoulli and Gaussian. Figure 3.4.1 shows a smaller version of the structure of the
underlying graph. The data were simulated under two setting (n, pc , pd) = (500, 100, 100)
and (100, 100, 100), where n denotes the sample size. pc denotes the number of Gaussian
variables, and pd denotes the number of Bernoulli variables. Under each setting, 10 datasets
were simulated independently.
The proposed three-stage method was first applied to this example. Since the true graph
is undirected, only the skeleton was outputted; that is, the directions of the edges of the
resulting Bayesian network were ignored. Figure 3.4.1 was drawn by varying the significant level
α2, while fixing the significant level α1 = 0.02 and α3 = 0.05.
Lee & Hastie (2015) proposed to recover the underlying graph structure by maximizing
the penalized pseudo-likelihood function. The pseudo-likelihood function is defined as the
product of the conditional likelihood functions as in Besag (1974), and the penalty terms
64
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Prec
isio
n
partial p−value algoritm with α1 = 0.2pseudolikelihood estimates of mixed graphical model
three stage methodpseudolikelihood estimates method
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Prec
isio
n
partial p−value with α1 = 0.2pseudolikelihood estimates of mixed graphic model
three stage methodpseudolikelihood estimates method
Figure 3-3. Precision-Recall curves produced by the three-stage and pseudo-likelihood methodfor two mixed datasets: the left generated under the setting(n, pc , pd) = (500, 100, 100) and the right under the setting(n, pc , pd) = (100, 100, 100).
Table 3-2. Average areas under the Precision-Recall curves produced by the three-stage andpseudo-likelihood methods. The number in parenthese represents the standarddeviation of the areas averaged over 10 datasets.
(n, pc , pd) three stage method pseudo-likelihood(500, 100, 100) 0.9970 0.9428
(8.22× 10−4) (9.49× 10−7)(100, 100, 100) 0.7493 0.4725
(0.012) (1.58× 10−4)
used there vary for scalars, vectors and matrices from l1 norm to l2 norm to Frobenius
norm. For simplicity, we call their method the pseudolikelihood method. For comparison,
the pseudo-likelihood method was applied to this example. The resulting precision-recall curves
for the two datasets were also shown in Figure 3.4.1. The resulting precision-recall curves
for the two dataset were also shown in Figure 3.4.1. The curves were drawn by varying the
regularization parameter from 0 to 8. The comparison indicates that the three-stage method
significantly outperforms the pseudo-likelihood method for this example.
Table 3-2 summarizes the areas under the precision-recall curves produced by the two
methods for the 20 simulated datasets. Both methods work stably for different datasets,
as indicated by their small standard deviations. The comparison shows that the three-stage
65
method significantly outperforms the pseudo-likelihood method, especially under the small
n-large-p-scenario (i.e., n < pc + pd).
3.4.2 Mixed Data for a Directed Graph
This example illustrates the performance of the three-stage method for learning Bayesian
networks with mixed data, along with comparisons with a variety of existing methods.
Following Kalisch & Buhlmann (2007), we simulated the mixed data in the following procedure:
(i) Fix an order of variables; (ii) randomly mark half of the variables as continuous and the
rest as binary; (iii) fill the adjacency matrix E with zeros, and replace the lower triangle (below
the diagonal) of E with independent realizations of Bernoulli random variables generated with
a success probability s ; and (iv) generate the data according to the adjacency matrix in a
sequential manner.
For this example, this variable X1, which corresponds to the first node of the Bayesian
network, was generated through a Gaussian random variable Y1 ∼ N(0, 1). We set X1 = Y1
if X1 was set to be continuous, and X1 ∼ Binomial(n, 1/(1 + e−Y1)) otherwise. The other
variables X ′i s, i = 2, 3, ..., p, were then sequentially generated by setting
Yi =
i−1∑k=1
0.5EikXK + ϵi , (3–8)
Xi =
Yi , if Xi is continuous,
Binomial(n, exp(Yi )1+exp(Yi )
), if Xi is binary,
(3–9)
where ϵ1, ..., ϵp are iid standard Gaussian random variables, and Eik denotes the (i , k) entry of
E. The success probability s used in step 3 controls the sparsity of Bayesian network. In our
simulations, we set s = 0.02. Let pc and pd denote the numbers of continuous and discrete
variables, respectively. In our simulations, we fix pc = pd = 50, while varying the sample size n
at four levels n = 100, 500, 1000 and 3000. For each value of n, ten datasets were simulated
independently.
66
The three-stage method was first applied to this example with the default setting α1 =
0.2, α2 = 0.05 and α3 = 0.05. Figure 3.4.2 shows the Bayesian network obtained by the
three-stage method with a dataset of size n = 3000, where the edge with double directions
mean that the direction of the edge is undetermined. Compared with the true Bayesian network
(shown on Figure 3.4.2), it is easy to see that many of the identified edges, including the
directions, are correct. For example, in the true Bayesian network the node 16 has 4 parents,
92, 71, 53 and 78, all of them were correctly identified by the three-stage method. Similarly,
the local structure around the node 52 was correctly recovered, and all the parent nodes 4, 56
and 99 of the node 100 were correctly identified.
67
Table 3-3. Average precision and recall of the directed graphs produced by three-stage, PC, HC and MMHC algorithms. Thenumber in parentheses represents the standard deviations of the value averaged over 10 datasets.
(n, p, q) P-screening algorithm PC algorithm HC algorithm MMHC algorithm GS algorithmPrecision Recall Precision Recall Precision Recall Precision Recall Precision Recall
(100, 50, 50)0.562 0.198 0.276 0.056 0.184 0.268 0.458 0.185 0.243 0.018(0.018) (0.016) (0.036) (0.007) (0.010) (0.013) (0.018) (0.007) (0.040) (0.004)
(500, 50, 50)0.665 0.737 0.607 0.346 0.499 0.482 0.652 0.389 0.270 0.016(0.011) (0.009) (0.010) (0.013) (0.019) (0.019) (0.024) (0.016) (0.035) (0.005)
(1000, 50, 50)0.702 0.866 0.619 0.463 0.561 0.533 0.665 0.450 0.450 0.038(0.008) (0.010) (0.012) (0.006) (0.020) (0.022) (0.024) (0.022) (0.060) (0.006)
(3000, 50, 50)0.740 0.990 0.725 0.627 0.568 0.561 0.689 0.539 0.385 0.049(0.015) ( 0.003) (0.011) ( 0.009) (0.024) (0.020) (0.030) (0.020) (0.039) (0.008)
68
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
7879
80
81
82
83
84
85
86
87
8889
90
91
92
93
94
95
96
97
98
99
100
Figure 3-4. The true directed network for a dataset with n = 3000 samples.
Table 3-3 summarizes the precision and recall values for the PDAGs produced by the
three-stage methods, where the PDAG refers to the network obtained at second stage
for which only the skeleton and v -structures are identified. For comparison, a variety of
existing methods, including PC (Spirtes et al., 2000; Kalisch & Buhlmann, 2007), hill-climbing
(HC) (Bouckaert, 2001), max-min hill climbing (MMHC) (Tsamardinos et al., 2006), and
grow-shrink (GS) (Margaritis, 2003), were applied to this example. All these methods have
been implemented in the R package pcalg or bnlearn. Among these methods, PC and GS
belongs to the class of constraint-based methods, HC belongs to the class of score-based
methods, and MMHC belongs to the class of hybrid methods.
The comparison indicated that for this example, the three-stage method is superior to the
existing methods in both precision and recall. In particular, when the sample size is moderately
69
1
2
34
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
3435 36
3738
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98 99
100
Figure 3-5. The estimated directed network for a dataset with n = 3000 samples.
large, say n = 500 and 1000, the three-stage method produced both good precision and good
recall values. The existing methods can be much inferior to it under this scenario.
3.5 Real Data Analysis
3.5.1 Lung Cancer Genetic Network
This study aims to learn an interactive genetic network for lung Squamous Cell Carcinoma
(LUSC), which incorporates both gene expression information (mRNA-array data) and
mutation information. The dataset was downloaded from The Cancer Genome Atlas (TCGA)
at http://tcga-data.nci.nih.gov/tcga/. The original mRNA data contains 17814 genes and 154
patients, and the original mutation data contains 14873 genes and 178 patients. We filtered
the mRNA data by including only 1000 genes with most variable expression values, and filtered
mutation data by including only 102 genes for which the mutation occurred in at least 15% of
70
KLHL13
BANK1
CXCL13
FOXE1
EMX2
ADAM23
KRT15
TMSL8
tcag7.1260
KRT34
NTRK2
MGC39715
ROPN1B
TXNRD1
LOC253970
CA8
EIF1AY
SAA4
SERPINA3
APOC1
KLRB1
DMRT1
AKR1B10
POU2AF1
UGT2B15
PAGE2B
SV2B
SERPINB2
C20orf114
SPON1
PLA2G4A
SLC26A4
EREG
CAMP
GRHL3
PTX3
C7
DYNLRB2
OASL
FLJ38723
ARMC3
EDIL3
LOC90925
DSC3
DMKN
NUDT11
FOS
RBM11
WNT2B
CXCL1
POPDC3
DAPL1
CYP39A1
PON3
RP1336C9.6
KIAA1622
SRD5A2
CSTA
IL1A
FAM3B
ROPN1
SPP1
GLI1
FAM81B
PTGS2
ADH1C
RPS4Y1
PI3
BNC1
SCGB1A1
MYEOV
CBR3
AQP3
KRT6C
LINGO2
CCDC38
FABP6
CLDN1
GSTA5
ATP8A2
PSCA
LEPREL1
SOX2
KRT75
ADAMDEC1
FBN2
SPRR2D
SEMA3E
LRRC17
MAGEA9
NPTX2
FAM80A
ARMC4
CLCA2
DEFA3
CDH16
JARID1D
CRNN
KLHL29
KRTDAP
SLC34A2
COL10A1
GPR87
KLK11
HS3ST3A1
AQP9
C1orf61
DSCR6
BCHE
TSPAN7
SFRP2
FAM9B
IL8
SLC44A5
GPX2HRASLS
MMP1
ASPN
MMP3
OGN
KRT16
VCX3A
DDX43
HS3ST2
PAMCI
S100A7
SERPINB3
GULP1
COL12A1
CCNA1
ITGB6
ACTBL1
EBF3
DOK6
FAM30A
KRT6A
DPT
GPR65
KIAA1822L
KRTAP31
HOXC9
SMC1B
DNAH9
CFTR
LMO3
KLK10
INDOL1OVOS2
LCN2
F3
CDKN2A
INDO
SCGB3A2
PTGIS
KRT23
HOXD11
FPRL1
FIGF
FABP5
FGFR3
FLJ35880
EGF
ABCC2
KIT
LUZP2
LASS3
GAST
FHOD3
IL19
FLJ45803
GSTM1
OSGIN1CYP4F2
NAPSA
CTSE
DMRT2
VWC2
SAA1
ATAD4
SLN
SPRR2G
IL1B
CDK5RAP2
PHACTR3
CLEC2B
CXCL11
MGC45438
SCRG1
C4orf19
SPANXD
FMO1
HMGA2
CYP4X1
SDPR
ZNF229
KLK13
IFIT1
MGC16291LTF
DSG3
DKK1
MAGEA1
MAP1B
KIAA0319
EDN2
PPP1R3C
PKD2L1
CYP24A1
TP63
SGEF
EDN1
LGI2
NAT8L
CXCR7
FETUB
FGF12
ZNF750
RBP7
CDKN2B
MME
SOX15
SFTPA1
DPPA2
PAK7
OLFM4
CPA3
GAS2
HEY1
GPR109B
CABYR
MGC10981
LRG1
LDLRAD1
SAC
COL11A1
CGB1
HOXC13
PLUNC
GJB6
KRT13
FAT2
LRRC4
AQP4
LIPH
THBD
ABCC3
FAM83C
S100A8
KHDRBS2
C8orf47
NQO1
GPR34CCL28
ADH1A
GSTM3
MGC102966COL17A1
XG
FABP4EGLN3
PAGE5
ZMAT4
PRAME
AHNAK2
CYP4F11
TCBA1
SFRP1
SCN9A
C19orf59
UGT2B17
CEACAM6
MS4A2
CYorf15B
PENK
ANXA8L2
PNOC
SPRR2F
LOC130576
BMP2
TMPRSS11D
NTS
MMP12
FGFBP1
SLC1A6
CTA246H3.1
HOXD10
ZNF415
SUSD4
AKR1C1
SBSN
HTRA4
VCX2
EYA2
NR0B1
CDH2
P2RY1
DMRTA1
SFRP4
C9orf18
C6orf156
STXBP5L
CLCA4
SRXN1
FLJ35773
SERPINB4
VNN1
ALDH3A1
RIBC2
PTH2R
GJB2
MICALCL
GSTA3
ABCA4
PCP4
TMEM46
SAA2
C20orf82
CLDN18
RNF128
COL6A6
C12orf54
ABCA3
GDA
UGT2B7
CD1A
GSTA2
TOX3
MYBPC1
UNQ9433
NRG4
NOS2A
GLYATL2
MUC15
POTE15
TSPAN19APOBEC3A
NPY
SOST
PAGE2
IL1F9
CBLN2GCLC
SCTR
CCL11
HOXD8
ANXA10
IRX4
RNF152
KIAA1257
PLAC1
GABRA5
MAGEA12
SAGE1
LOC253012
CSAG3A
S100A12
ABCA13
AGR2
CLCA3
PADI3
NUDT10
ARL14
FA2H
SH3GL3
G0S2
ADCY8
IGFL2
IL6
C5orf23
LYPD1
KRT14
MS4A1
CHP2
IGLV214
CXorf57
LCE3D
CYP2C18
IFI44L
SLC47A2
PPP2R2B
NRCAM
GBA3
AKR1C3
CYorf15A
CES1
RPS4Y2
ZNF556
KRT17
HRASLS3
WNT5A
C4orf7
GABRG3
CALB2
MAGEA3
KLK12
CSAG1
PLA2G1B
MUCL1
CT456
MGC9913
CA3
NDRG4
HOXC8
ME1
IVL
AGTR2
MAGEA4
TMPRSS11A
EHF
CTNND2
LGR6
PROM2
CRYM
CYP4F8
FCER1A
NPR3
ABHD7
TMEM20
FLJ21511
MB
CYP4Z1
C2orf54
GBP6
ATP6V1C2
CSN1S1
GLT25D2
HHIPWBSCR28
SPRR3
ART3
PLAC8
SLC14A1
TP53
CDH18
TNR
AHNAK2
NAV3
ANK2
USH2A
OR2T4
CDH12
SPHKAPBIRC6
LOC96610
Figure 3-6. The Bayesian network produced by the three-stage method with the mRNA (circlenodes) and mutation (square nodes) data measured on the same set of 121 LUSC(Lung Squamous Cell Carcinoma)
the patients. We then merges the two data sets together. The merged dataset consists of 121
patients, which are common to both the mRNA and mutation data.
Figure 3.5.1 shows the Bayesian network produced by the three-stage method with
α1 = 0.1, α2 = 0.05 and α3 = 0.05. Since the sample size n = 121 is small relative to the
total number of variables p = 1102, a small value of α1 is used. We expect that such a small
value of α1 will result in a litter smaller superset Si for each node. Further, to improve the
power of the moral graph screening tests, the spouse tests have been performed at significant
level of 0.05. The spouse tests reduced the size of the conditioning sets used in the moral
graph screening tests. The three-stage method cost 1.94 CPU hours on a 3.6GHZ desktop for
71
KLHL13
SALL1
BANK1
CXCL13
BMP3
CPNE4
FOXE1
PGBD5
FNDC1
EMX2
ADAM23CHST6
TMEM125
KRT15
RDHE2
UPK1B
UGT1A8
MS4A8B
TMSL8tcag7.1260
DDIT4L
KRT34
C1orf125
GAL
FAM83A
KIAA1045
NLGN4X
ALOX12
TCEAL2
NTRK2
hCG_1990170
KLK7
C20orf85
KRT4
PCDH20
MGC39715
ROPN1B
COMPTXNRD1
BRDT
LOC253970
SRPX
COL4A6
CAPSL
CA8
SFTPC
EIF1AY
SAA4
SOSTDC1
C4BPA
KCNMB4
SERPINA3
KCNJ16
APOC1
KLRB1
IGFL1
DMRT1
MMP13
RTP3
CLGN
AKR1B10
GALNT13
POU2AF1
UGT2B15
VSNL1
MAG1
PAGE2B
SV2B
CRYBA2
EYA1
SERPINB2
C20orf114
HPR
C18orf2
SPON1
PLA2G4A
7A5 AMPD1
SLC26A4
EREG
CAMP
GRHL3
PTX3
MUC4
C7
DYNLRB2
SLITRK6
OASL
RNF165
FLJ38723
ARMC3
ZDHHC11
EDIL3
LOC90925
DSC3
HTR2B
DMKN
NUDT11
FOS
RBM11
FBXL21
WNT2B
CXCL1BEX1
SLC26A9
TMPRSS11F
FAM133A
FGFBP2
IL33
SEC14L4
KCNMB2
KRT5
PPP1R14C
POPDC3
TM4SF19
SDCBP2
GATM
GAD1
VAV3
DAPL1
CBR1
CYP39A1
GLULD1
PON3
GHR
RP1336C9.6
KRTAP191
KIAA1622
KLRG2
PRSS21
SLC10A4
PKIB
SRD5A2
CSTA
IL1A
FAM3B
ROPN1
LTB4DH
SPP1
GLI1
SFTPD
FAM81B
PTGS2
BMP7
NEFL
ADH1C
SLCO1B1
PGLYRP4
RPS4Y1
C21orf81
PI3
BNC1
PCDHB6
PAGE1
D4S234EZNF365
SCGB1A1
HSD17B3
MYEOV
CBR3
AQP3
KRT6C
LINGO2
CCDC38
SEMA6D
TTC29
WFDC2
FABP6
GABRB3
C12orf56
CLDN1
GSTA5
ATP8A2
PSCA
LEPREL1
ZFP42
SOX2
KRT75
TMPRSS11B
ODZ2
S100P
SLC6A4
ZYG11A
DEFB1
TCHH
ADAMDEC1
TMEM45A
FBN2
SPRR2D
SEMA3E
LRRC17
C10orf81
MAGEA9
NPTX2
DEFB103A
EDN3
S100A14
AMY2A
FAM80A
ARMC4
CLCA2
RHCG
IFNE1
CPA6
DEFA3
PLA2G3 CDH16
FOXA1
JARID1D
CA2PCDH8
CRNN
ECAT8
KLHL29
KRTDAP
WISP3
SCNN1A
CXCL6
SLC34A2
EYA4
COL10A1
PVRL3
GPR87
KLK11
DNER
FMN2
BEX5
SPRR1B
CRCT1
LOC441376NAP1L2 HS3ST3A1
CPXM2
RP1135N6.1
AQP9
LRRN1
CLC
C1orf61
DSCR6
BCHE
BCL2
KIAA1324
TSPAN7
LONRF2
LOC440356
SFRP2
GPR158
OMD
FAM9B
IL8
SERPINB13
LYPD3
IGFBP2
P53AIP1
WIF1
SLC44A5
MUC20
GPX2
AREG
NRXN1
COCH
HRASLS
MMP1
CA12
WNT16
MAGEA10
ASPN
DACT2
MMP3OGN
KRT16VCX3AAMDHD1
C6orf142
DDX43
GAS1
FAM112B
SOHLH1
HS3ST2
GPC3
PAMCI
S100A7
NXF5
SERPINB3
PCOLCE2
C3orf55
HORMAD1
GULP1
CH25H
COL12A1
CCNA1
KLK6
ITLN1
CALB1
C4orf31
MUC16
JAKMIP2
C8orf4
PTHLH
FGB
POU4F1
ATP13A4
RASEF
ZFHX4
CDH18
CHL1
PAGE4
ITGB6
ACTBL1
MATN3
EBF3
TFF2
SMPX
C1orf87
DOK6
FAM30A
GOLSYN
GJB7
KRT6A
C6orf117
GLYATL1
DPT
ACTL8
GPR65
KIAA1822L
IL12RB2
KRTAP31
HOXC9
SMC1B
LEMD1
OXGR1
ALDH1A1
DNAH9
IRX2
TMEM22
CFTR
RTP4
LMO3
C9orf24
SFTPA1B
KLK10
INDOL1
OVOS2
LCN2
F3
CDKN2A
C4orf26
INDO
SCGB3A2
GSTT1
PTGIS
KRT23
HAPLN1
TNIP3
GOLT1A
HOXD11
PCSK1
KRT1
FLJ22655
FPRL1
MSMB
IL1F5
FIGF
FABP5
PNLDC1
FGFR3
FLJ35880
VTCN1
HIST1H1A
MAGEC1
EGF
ABCC2
KIT
DKK4
ACE2 LUZP2
LASS3
GAST
FZD10
FHOD3
IL19
LECT1
CRABP2
FLJ45803
GSTM1
TSKS
POSTN
OSGIN1
CYP4F2
GPR128
WNT2
NAPSA
MGAT4C
CTSE
FLJ46266
DMRT2
VWC2
C20orf103
SAA1ATAD4
GLI2
SLN
RGS17
SPRR2G
IL1B
ATP8B3
PCDHB5
tcag7.1136
C2orf39
MMP10
AMIGO2
RPL39L
ADRA2C
PKP1
CDK5RAP2
PHACTR3
CLEC2B
MAEL
COL21A1
HOXB13
RAPGEFL1
CXCL11
MGC45438
SCRG1
DLX6
IL13RA2
C4orf19
ODAM
GSTO2
SLC16A9
CFHR4
TMEPAI
CDH26
SPANXDCCK
FMO1
SLC35F3
SCGB2A1
SLC6A14
HMGA2
LOC285141
TMEM45B
SESN3
CYP4X1
SDPR
ZNF229
KLK13
IFIT1
PRB1
MGC16291
CXCL9
TPO
LTF
HOXA4
CHIA
CEACAM7
DSG3
DKK1
MAGEA1
MAT1A
MAP1B
KIAA0319
ZBED2
SLC5A1
EDN2
C3orf41
PPP1R3C
SOX11
PKD2L1 CP
CYP24A1
FXYD3
TP63 WDR66
SGEF
BBOX1 EDN1
PTPRR
CAPS
LGI2
NAT8L
C6orf15
SYT13
PPBP
SLITRK4
TFAP2B
C6orf105
CXCR7
FAM79B
KCNK10
FSTL5
CYP3A5
FETUB
IL23A
LOC399947
WDR16
FGF12
FMO9P
CLDN10
ZNF750
RBP7
DSC1
HS3ST5
CDKN2B
MME
SOX15
SFTPA1
DPPA2
DEFB4
SERPINB11
LGALS2
MAGEA8F2RL2
ATP2C2
PNPLA3
TUBB2B
PAK7
OLFM4
TSPY2
SCG2
FMO3
DDX3Y
CPA3
BARX1
GAS2
HEY1
ODC1
GPR109B
TSLP
CABYR
MGC10981
LPL
MAGEA11LRG1
CDH19LDLRAD1
TP53TG3
IQCA
SAC
C11orf41
COL11A1
C20orf174
ICEBERG
CGB1
ST6GALNAC1
MYB
OR6X1
MPPED2
HOXC13
GPNMB
C16orf73
SASP
PLUNC
GJB6KRT13
ATP4A
NEFM
FAT2
LRRC4
AQP4
LIPH
SELETHBD
FLJ39822
BTBD16
KRT24
ABCC3
NLRP7
FAM83C
CRLF1
HSPA1A
UGT1A6
UBD
S100A8
KHDRBS2
TMEM100SPINK2
SPESP1
PDPN
CAPN13
C8orf47
TCN1
INA
COL4A4
NQO1
IFNG
FOXL2
GPR34
KLK5
CCL28
LDHC
ADH1A
GSTM3
MGC102966
COL17A1
XG
MSLN
ELF5
FABP4
CLDN8
EGLN3
PAGE5
AIM2
ZMAT4
KRT6B
IL20RB
AHNAK2
TFPI2
CYP4F11
LHX2
TCBA1
SFRP1
TNNT1
C19orf59
MUM1L1
CYP26A1
PEBP4
UGT2B17
CEACAM6
MS4A2
HS3ST1
SLC6A15
CYorf15B
DMRT3
PPP1R9A
BARX2
PENK
ANXA8L2
OPRK1
PNOC
RPS6KA6
SPRR2F
LOC130576
CYP2C9
RUNX1T1
BMP2
ROPN1L
GAP43
RASGEF1A
TMPRSS11D SLC7A11
NTS
CHIT1
MIA
PCSK2
MMP12
FGFBP1
CHGB
HAS3
SLC1A6
CTA246H3.1
PLAT
HOXD10
FAM132A
STAR
TBX18
ZNF415
SUSD4
FLRT3
LOX
AKR1C1
LYZ
SP8
KCND2
SBSN
MLLT11
TMTC1
C2orf40
HTRA4
VCX2
MAGEB2
EYA2
SPINK1
CILP
AGR3
AADACL2
SLCO1B3
WFDC5
PROK2
DACH1
OSTalpha
CREG2
BDNF
HSPB3 SCIN
LOC124220
NR0B1
CDH2
SERPINB7
STEAP4
P2RY1
ADH7
FHL2
PCDHB2
RAB6B
STK32A
SLPI
DMRTA1
SFRP4
GLI3
C9orf18
MAGEC2
C6orf156
TGM1
SULT1E1
HIST1H1B
ARL9
AFF2
STXBP5L
CLCA4
SRXN1
FLJ35773
STXBP6
CHST9
STRA8
SERPINB4
VNN1
ALDH3A1
RIBC2
PTH2R
GJB2MICALCL
GSTA3
NPPC
SSTR1ABCA4
RP13102H20.1
SPINK6
ATP12A
PIP
PCP4
TMEM46
SAA2
GPR103
C20orf82
OGDHL
CLDN18
RNF128
CD274
COL6A6
RNF183
C9orf47
C12orf54
ABCA3
XK
GDA
UGT2B7
C1QTNF3
CD1A
KLK8
GSTA2
TOX3
MYBPC1
EN1
UNQ9433POF1B
TKTL1
EPGNNRG4
NOS2A
GLYATL2
MUC15
RNF182
SOX9
POTE15
RGS20
CA9
TSPAN19
APOBEC3A
NPY
SOST
ECHDC3
PAGE2
CTCFL
IL1F9
NELL2
POU3F2
GNGT1CBLN2
GCLC
GPR110
AMTN
PSMAL
ZNF695
SCTR
CCL11
HOXD8
ZNF114
NRN1
ANXA10
IRX4
CEACAM5
FUT3
ERP27
RNF152
KIAA1257
PLAC1
C19orf46
PHEX
FGF13
GABRA5
GRP
MAGEA12
SAGE1
LOC253012
CSAG3A
LY6D
LMO1
S100A12
ARSJ
GKN2
SLC5A8
CYP4B1
ABCA13
AGR2
CLCA3
PTN
PADI3
NUDT10
PTPRZ1
ARL14
FAM137A
SUNC1
PAX9
CST1
OLFM1
FLJ44379
PITX2
FA2H
SPOCK3
SH3GL3
ALDH7A1
G0S2
ADCY8
IGFL2
CHODL
IL6
C5orf23
LYPD1
KRT14
KREMEN2 MS4A1
CHP2
CDO1
KCNH8
IGLV214
HOXC12
CXorf57
LCE3D
IL1R2
CYP2C18 IFI44L
TMEM190
NLGN1
RPESP
FAM90A1
CNTNAP2
ABHD12B
SLC47A2
PPP2R2B
NRCAM
GBA3
CRTAC1
AKR1C3
C10orf82
CYorf15A
OCA2
CES1
HSD17B2
ITLN2
WDR69
RPS4Y2
ZNF556
LRRK2
HOXD13
KRT17C6orf159
VIT
HRASLS3
WNT5A
C4orf7
GABRG3
CALB2
MAGEA3
PRDM13
KLK12
ABCA12
CSAG1
EPHB1
CTTNBP2
COL9A1
CALML5
HOTAIR
PLCB4
PLA2G1B
MUCL1
CT456
S100A2
GALNT14
PF4V1
REG1A
MGC9913
KRT7
CXorf48
CA3
NDRG4
NRXN3
HOXC8
VNN2
ME1
IVL
AGTR2
ROS1
C6orf54
MAGEA4
TMPRSS11A
CTNND2
LGR6
FABP7
DSCR8
PROM2
HES5
CRYM
UCHL1
AKAP14
TSPAN8
C6orf150
COX7B2
CYP4F8TRIM43
OR7A5
ZIC1
FCER1A
ARG1
NPR3
PPP2R2C
ABHD7
TMEM20SPRR1A
SCEL
TMPRSS2
FLJ21511
HLADRB6
ATP6V0A4
DGKG
MB
CYP4Z1
DMBT1
NLRP2
NELL1
C2orf54
WNT10A
GBP6
ATP6V1C2
CSN1S1
GLT25D2
HHIP
WBSCR28
SPRR3
ART3
PLAC8
SALL3
SLC14A1
SULT1B1
KLRD1
DAZ4
MLL3
CSMD3
COL11A1
PKHD1
FAT3
ADAM6
CDKN2A
SYNE2
SPTA1
USH2A
OR2T4
HEATR7B2
FBN2
PKHD1L1
MUC5B
DPP10
C1orf173
CTNNA2DNAH9
BIRC6
SLITRK3MUC4
LRFN5
LOC96610
RELN
PDE4DIP
STAB2
Figure 3-7. The Bayesian network produced by MMHC with the mRNA (circle nodes) andmutation (square nodes) data measured on the same set of 121 LUSC (LungSquamous Cell Carcinoma)
this example, and the resulting Bayesian network consists of 693 edges. As shown in Figure
3.5.1, the network contains four clusters, which are centered at the genes POU2AF1, KLK10,
HOXC13 and NPR3, respectively. It is interesting to point out that all these four hub genes
are lung cancer related. For example, Zhou et al. (2016) reported that POU2AF1 functions
in the human epithelium to regulate expression of host defense genes, and Faner et al. (2016)
found that POU2AF1 is a B-cell recruitment and immunoglobulin transcription gene which is
correlated with emphysema severity. KLK10 has been shown to over-expressed in lung cancer,
see e.g., Cantile et al. (2012) and Carvalho et al. (2012). Besides the hub genes, several
significant RNA and mutation interactions have been identified as well. For example, CDH12,
72
KLHL13
SALL1
BANK1
CXCL13
LOC152573
BMP3
CPNE4
FOXE1
NEFH
PGBD5
TNNT3
FNDC1
EMX2
ADAM23
CHST6
TMEM125
KRT15
RDHE2
FST
UPK1B
UGT1A8
MS4A8B
TMSL8
tcag7.1260
DDIT4L
KRT34
C1orf125
GAL
FAM83A
KIAA1045
NLGN4X
ALOX12
TCEAL2
NTRK2
hCG_1990170
KLK7
C20orf85
KRT4
PCDH20
MGC39715
ROPN1B
GABRP
COMP
TXNRD1
BRDT
LOC253970
SRPX
COL4A6
CAPSLCA8
SFTPC
EIF1AY
SAA4
SOSTDC1
C4BPA
KCNMB4
SERPINA3
KCNJ16
APOC1
KLRB1
IGFL1
DMRT1
MMP13
RTP3
CLGNAKR1B10
GALNT13
POU2AF1
UGT2B15
VSNL1
MAG1
PAGE2B
SV2B
CRYBA2
EYA1
SERPINB2
C20orf114
HPR
C18orf2
SPON1
PLA2G4A
7A5
AMPD1
SLC26A4
EREG
CAMP
GRHL3
PTX3
C1orf110
MUC4
C7
DYNLRB2
SLITRK6
OASL
RNF165
FLJ38723
ARMC3
ZDHHC11
EDIL3
LOC90925
DSC3
HTR2B
DMKN
NUDT11
FOS
RBM11
FBXL21
WNT2BCXCL1
BEX1
SLC26A9
TMPRSS11F
C3orf57
FAM133A
FGFBP2
IL33
SEC14L4
KCNMB2
KRT5
PPP1R14C
POPDC3 TM4SF19
SDCBP2
GATM
GAD1
VAV3
DAPL1
CBR1
CYP39A1
GLULD1
PON3
GHR
RP1336C9.6
KRTAP191
KIAA1622
KLRG2
PRSS21
SLC10A4
PKIB
SRD5A2
CSTA
IL1A
FAM3B
ROPN1
LTB4DH
SPP1
GLI1
SFTPD
FAM81B
PTGS2
BMP7
NEFL
ADH1C
SLCO1B1
PGLYRP4
RPS4Y1
C21orf81
PI3
BNC1
PCDHB6
PAGE1
D4S234E
ZNF365
SCGB1A1
HSD17B3
MYEOV
CBR3
AQP3
KRT6C
LINGO2
CCDC38
SEMA6D
TTC29
WFDC2
FABP6 GABRB3
GCNT3
C12orf56
CLDN1
GSTA5
ATP8A2
PSCA
LEPREL1
ZFP42
SOX2
KRT75
TMPRSS11B
ODZ2
S100P
SLC6A4ZYG11A
DEFB1
TCHHADAMDEC1
TMEM45A
FBN2
SPRR2D
SEMA3E
LRRC17
C10orf81
CTAG2
MAGEA9
NPTX2DEFB103A
EDN3
S100A14
AMY2A
FAM80A
ARMC4
CLCA2
RHCG
IFNE1
CPA6
DEFA3PLA2G3
CDH16
FOXA1
JARID1DCA2
PCDH8
CRNN
DDC
PROM1
ECAT8
KLHL29
KRTDAP
WISP3
SCNN1A
CXCL6
SLC34A2
EYA4
COL10A1
PVRL3
GPR87
KLK11
DNER
FMN2
BEX5
SPRR1B
CRCT1
LOC441376NAP1L2
HS3ST3A1
CPXM2
RP1135N6.1
AQP9
LRRN1
CLC
C1orf61
DSCR6
BCHE
BCL2
KIAA1324TSPAN7
LONRF2
LOC440356
SPINK5
SFRP2
GPR158
OMD
FAM9BIL8
SERPINB13
LYPD3
IGFBP2
P53AIP1WIF1
SLC44A5
MUC20
GPX2 AREG
NRXN1
COCH
HRASLS
MMP1
CA12
WNT16
MAGEA10ASPN
DACT2
MMP3
OGN
KRT16
VCX3AAMDHD1
C6orf142
DDX43
GAS1
FAM112B
SOHLH1
HS3ST2
GPC3
PAMCI
S100A7
NXF5
SERPINB3
STMN2
PCOLCE2
C3orf55
HORMAD1
GULP1
CH25H
COL12A1
CCNA1 CXCL14
KLK6
ITLN1
CALB1C4orf31
MUC16
JAKMIP2
C8orf4
PTHLH
GALNT5
FGB
POU4F1ATP13A4
RASEF
ZFHX4
CDH18
CHL1
PAGE4ITGB6
ACTBL1
MATN3
EBF3
TFF2
SMPX
C1orf87
DOK6
FAM30A
GOLSYN
GJB7
KRT6A
C6orf117
GLYATL1
DPT
ACTL8
GPR65
KIAA1822L
IL12RB2
KRTAP31
HOXC9
SMC1B
LEMD1
OXGR1
ALDH1A1
DNAH9
IRX2
TMEM22
CFTR
RTP4
LMO3
C9orf24
SFTPA1BKLK10
INDOL1
OVOS2 LCN2F3
CDKN2A
C4orf26
INDO
SCGB3A2
GSTT1
PTGIS
KRT23
HAPLN1
TNIP3
GOLT1A
HOXD11
PCSK1
KRT1
FLJ22655
FPRL1 MSMB
IL1F5
FIGF
FABP5
PNLDC1
ZNF334
FGFR3
FLJ35880
VTCN1
HIST1H1A
MAGEC1
EGF
ABCC2
KIT
DKK4
ACE2
LUZP2
LASS3
GAST
FZD10
FHOD3
IL19
LECT1
CRABP2
FLJ45803
GSTM1
TSKS
POSTN
OSGIN1
CYP4F2
GPR128
WNT2
NAPSA
MGAT4C
CTSE
FLJ46266
DMRT2
VWC2
C20orf103
SAA1
ATAD4
GLI2
SLN
NMU
RGS17
SPRR2G
IL1B
ATP8B3
PCDHB5
tcag7.1136
C2orf39
MMP10
AMIGO2 RPL39L
ADRA2C
PKP1
CDK5RAP2
PHACTR3
AADAC
CLEC2B
MAEL
COL21A1
HOXB13
RAPGEFL1
CXCL11
MGC45438
SCRG1
LOC644186
DLX6
IL13RA2
C4orf19
ODAM
GSTO2
SLC16A9
CFHR4
TMEPAI
CCL20
CDH26SPANXD
CCK
FMO1
SLC35F3
SCGB2A1
SLC6A14
HMGA2
LOC285141
TMEM45B
SESN3
KRT27
CYP4X1
SDPR
ZNF229
KLK13
IFIT1
PRB1
MGC16291
CXCL9
TPO
LTF
HOXA4
CHIA
CEACAM7
DSG3
DKK1
MAGEA1MAT1A
MAP1B
KIAA0319
ZBED2
SLC5A1
EDN2
C3orf41
PPP1R3C
SOX11
PKD2L1
CP
CYP24A1
FXYD3
TP63
WDR66
SGEF
BBOX1
EDN1
PTPRR
CAPS
LGI2
NAT8L
C6orf15
SYT13
LRAP
PPBP
SLITRK4
TFAP2B
C6orf105
CXCR7
FAM79BKCNK10
FSTL5
CYP3A5
FETUB
IL23A
LOC399947
WDR16
FGF12
FMO9P
CLDN10
ZNF750
RBP7
DSC1
HS3ST5CDKN2B MME
SOX15
SFTPA1
DPPA2
DEFB4SERPINB11
LGALS2
MAGEA8
F2RL2
ATP2C2
PNPLA3
TUBB2B
PAK7
OLFM4
TSPY2
SCG2
FMO3
DDX3Y
CPA3
BARX1
GAS2
HEY1
ODC1
GPR109B
TSLP
CABYR
MGC10981
LPL
MAGEA11
LRG1
CDH19
LDLRAD1
TP53TG3
IQCA
SAC
C11orf41
COL11A1
C20orf174
ICEBERG
CGB1
ST6GALNAC1
MYB
OR6X1
MPPED2
HOXC13GPNMB
C16orf73
SASP
PLUNC
GJB6
KRT13
ATP4A
NEFM
FAT2
LRRC4
AQP4 LIPH
SELE
THBD
FLJ39822
BTBD16
KRT24
ABCC3
NLRP7
FAM83C
CRLF1
HSPA1A
UGT1A6
UBD
S100A8
KHDRBS2
TMEM100
SPINK2
SPESP1
PDPN
CAPN13
C8orf47
TCN1
INA
COL4A4
NQO1
IFNG
FOXL2
GPR34
KLK5
CCL28
LDHC
ADH1A
GSTM3
MGC102966
COL17A1
XG
MSLN
ELF5
FABP4
CLDN8
EGLN3
PAGE5
AIM2ZMAT4
KRT6B
IL20RB
PRAME
FAM70A
AHNAK2
TFPI2
CYP4F11
LHX2
TCBA1
SFRP1
SCN9A
TNNT1
C19orf59
MUM1L1
CYP26A1
PEBP4
UGT2B17
CEACAM6MS4A2
HS3ST1
SLC6A15
CYorf15BDMRT3
PPP1R9A
BARX2
PENK
ANXA8L2
OPRK1
PNOC
RPS6KA6
SPRR2F
LOC130576
CYP2C9
RUNX1T1
BMP2
ROPN1L
TSPYL5
GAP43
RASGEF1ATMPRSS11D
SLC7A11
NTS
CHIT1
MIA
PCSK2
MMP12
FGFBP1
CHGB
HAS3
SLC1A6
CTA246H3.1
CALCB
PLAT
HOXD10
FAM132A
STAR TBX18
ZNF415
SUSD4
FLRT3LOX
AKR1C1
LYZ
SP8
KCND2
SBSN
MLLT11
TMTC1
C2orf40
HTRA4
VCX2
THNSL2
MAGEB2
EYA2
SPINK1
CILP
AGR3
AADACL2 SLCO1B3
WFDC5
PROK2
DACH1
OSTalpha
CREG2
BDNF
HSPB3
SCIN
LOC124220
NR0B1
CDH2
SERPINB7
NDUFA4L2
STEAP4
P2RY1
ADH7
FHL2
PCDHB2
RAB6B
STK32A
SLPI
DMRTA1
SFRP4
GLI3C9orf18
MAGEC2
C6orf156TGM1
SULT1E1
HIST1H1B
ARL9
AFF2
STXBP5L
CLCA4
SRXN1
FLJ35773
STXBP6
CHST9
STRA8
SERPINB4
VNN1
ALDH3A1
RIBC2
PTH2R GJB2
MICALCL
GSTA3
NPPCSSTR1
ABCA4
RP13102H20.1
SPINK6
ATP12A
PIP
CCND2
PCP4
TMEM46
SAA2
GPR103
C20orf82
OGDHL
CLDN18
RNF128
CD274
COL6A6
RNF183
C9orf47
C12orf54
ABCA3
XK
GDA
UGT2B7
C1QTNF3
CD1A
KLK8
GSTA2
TOX3
MYBPC1
EN1
UNQ9433
POF1B
TKTL1
EPGN
NRG4
NOS2A
GLYATL2
MUC15
RNF182SOX9
POTE15
RGS20
CA9
TSPAN19
APOBEC3A
NPY
SOST
ECHDC3
PAGE2
CTCFL
IL1F9
NELL2
POU3F2
GNGT1
CBLN2
GCLC
GPR110
AMTN
PSMAL
ZNF695
SCTR
CCL11
HOXD8
ZNF114
NRN1
ANXA10
IRX4
CEACAM5
FUT3
ERP27
RNF152
KIAA1257
PLAC1
C19orf46
PHEX
FGF13GABRA5
GRP
MAGEA12
SAGE1
LOC253012CSAG3A
LY6D
LMO1
S100A12
ARSJ
GKN2
SLC5A8
CYP4B1
ABCA13
AGR2
CLCA3
PTN
PADI3NUDT10
PTPRZ1
ARL14
FAM137A
SUNC1PAX9
CST1
CCL7
OLFM1
FLJ44379
PITX2
FA2H
SPOCK3
SH3GL3
ALDH7A1
G0S2
ADCY8
IGFL2
CHODL
IL6
C5orf23
LYPD1
KRT14
KREMEN2
MS4A1
CHP2
CDO1
KCNH8
TRIM49
IGLV214
HOXC12
CXorf57
LCE3D
IL1R2
CYP2C18
IFI44L
TMEM190
NLGN1
RPESP
FAM90A1
CNTNAP2
ABHD12B
SLC47A2
PPP2R2B
NRCAMGBA3
CRTAC1
AKR1C3
C10orf82 CYorf15A
OCA2
CES1
HSD17B2 ITLN2
WDR69
RPS4Y2
ZNF556
LRRK2
HOXD13
KRT17
C6orf159
VIT
HRASLS3
WNT5A
C4orf7
GABRG3
CALB2
MAGEA3
PRDM13
KLK12
ABCA12
CSAG1
EPHB1
CTTNBP2
COL9A1
CALML5
HOTAIR
PLCB4
PLA2G1B
MUCL1
CT456S100A2
GALNT14
PF4V1
REG1A
MGC9913
KRT7
CXorf48
CA3
NDRG4
NRXN3
HOXC8
VNN2
ME1
IVL
AGTR2
ROS1
C6orf54
MAGEA4
TMPRSS11A
EHF
CTNND2
LGR6
FABP7 DSCR8
PROM2
HES5
CRYM
UCHL1
AKAP14
TSPAN8
C6orf150
COX7B2
CYP4F8
TRIM43
OR7A5
ZIC1
FCER1A
ARG1
NPR3
PPP2R2C
ABHD7
TMEM20
SPRR1A
SCEL
TMPRSS2
FLJ21511
HLADRB6
ATP6V0A4
DGKG
MB
CYP4Z1
DMBT1
NLRP2
NELL1
C2orf54
WNT10A
GBP6
ATP6V1C2
CSN1S1
GLT25D2
HHIP
WBSCR28
SPRR3
ART3
PLAC8
SALL3
SLC14A1
SULT1B1
TEX101
KLRD1
DAZ4
FLG
LRP2
TTN
DNAH5
MLL3
MLL2TP53
MACF1
APOB
CNTNAP5
CDH18
SYNE1
ABCA13
PXDNL
CSMD3
MYCBP2
PEG3
DMD
CSMD2
LRRC7
COL11A1
TNN
TNR
RYR2
LRP1B
SCN1A
CDH10
HCN1
DNAH8
PKHD1
ZFHX4
FAM135B
FAT3
AHNAK2ADAM6
HERC2
MUC16
RYR1
TPTE
XIST
PCDH11X
NRXN1
ALMS1
DNAH7
MDN1
CNGB3CDKN2A
NAV3
SYNE2
ZNF99
NEB
ANK2
FAT4
PCLO
ZNF804B
SPTA1
PAPPA2
USH2A
OR2T4
XIRP2
SI
CDH12
HEATR7B2GPR98
FBN2
DNAH11
PLXNA4
PKHD1L1
COL22A1
CUBN
PCDH15
MUC5B
DNAH10
TMEM132D
RYR3
MYH2
ZNF208
ZNF536
DPP10RP1
COL6A6
MUC17
CSMD1
ANKRD30A
C1orf173
CTNNA2
CNTNAP2
FMN2
CPS1
SPHKAP
DNAH9
BIRC6
SLITRK3
FAM5C
ODZ1
MUC4
ADAMTS12
LRFN5
LOC96610
RELN
PDE4DIP
STAB2
Figure 3-8. The Bayesian network produced by HC with the mRNA (circle nodes) andmutation (square nodes) data measured on the same set of 121 LUSC (LungSquamous Cell Carcinoma)
which is linked to the cluster of HOXC13, has been recently verified to play a significant
role in the progression of lung cancer, and patients without CDH12 mutations have a longer
survival rate than those with CDH12 mutation (Zhao et al., 2013). CDH18, which is linked to
the cluster of NPR3, is among the newly identified mutated genes of lung cancer (Liu et al.,
2012). BIRC6 is a potential prognostic biomarker in patients with non-small cell lung cancer
(Gharabaghi, 2016). Additionally, abnormality of TP53 gene is one of the most significant
events in lung cancers and plays an important role in the tumorigenesis of lung epithelial cells
(Mogi & Kuwano, 2011).
73
For comparison, the existing methods, including HC, MMHC, PC and GS, were also
applied to this example. On the same computer as used by the three-stage method, MMHC
took 0.79 CPU hours and HC took 10.89 CPU hours. The GS and PC took more than 40
CPU hours but with no results produced. MMHC is a heuristic algorithm, which employs
some heuristic rules to first learn a set of candidate parents for each node and then conduct
a hill-climbing greedy search to find a sub-optimal network. Potentially, the method can be
pretty due to its heuristic nature. The HC method searches for an optimally scored network
over a large space of networks and hence can be pretty slow. We note that the three-stage
method is currently implemented in R, and its computation is expected to be much shortened if
implemented in C or FORTRAN.
Figure 3.5.1 and Figure 3.5.1 show the Bayesian networks produced by MMHC and HC,
respectively. The former consists of 1295 edges and the latter consists of 11999 edges. The
network produced by the HC method is too dense. The network produced by MMHC looks
reasonably good, although the consistency of the method cannot be guaranteed. This is
consistent with its performance is simulation studies, see Table 3-3.
3.5.2 Glioblastoma Genetic Network with Methylation Adjustment
Covariate effect adjustment is important in learning gene regulatory networks, as the
relationship between variable can be effected by external variables. The covariate effects can
be easily adjusted under the framework provided by the p-learning method. Let w1, ...,Wq
denote the external variables. To adjust their effects, we can replace the p-vales used in the
parents-children screening step by the p-values calculated from the GLM
Xi ∼ 1 + Xj +W1 +W1 + ... +Wq, (3–10)
in testing the hypothesis H0 : γ′ij = 0 versus H1 : γ
′ij = 0, where γ′ij is the regression coefficient
of Xj . Similarly, the p-values used in the moral graph screening step can be replaced by the
74
ACSM3
ADA
AGMAT
ALX1
ANKRD12
ANKRD44
ANXA3
AP3S1
AQP12B
ARF3
ARL11
ASB3
B2M
C11orf64
C16orf30C17orf65
C18orf25
C1QC
C20orf103
C3orf27
C4BPB
C6orf201
C7orf41
C7orf45
CA2
CAMK1G
CAND1
CAPG
CD14
CD163
CLCN4
COL3A1
COX4I1CSF1R
CYP3A5
DCX
DIAPH3
DLX4
DTX1
EIF1AY
EMX2
EPN1
FAM26B
FAM26C
FCGR3A
FCGRT
FLJ25801
FLJ41603
FOXG1
GDF3
GDPD2
GIMAP1
GIMAP4
GLDN
GMEB2
GPC3GPM6B
HBB
HCK
HLA.DMB
HLA.DRA
HLA.F
HMGCLL1
HOPX
HSD17B7
IMMP1L
IMPG1
INDOL1
INPP4B
ISX
ITGA4
ITM2B
KBTBD5
KCNJ6
KCNK4
KCTD12
KIFC3
KLHDC8A
KLK11
KLRF1
LEPR
LLGL2
LOC152586
LOC441601
LOC51057
LOC641367
LOXL3
LPHN3
LRP8 MAGEC1
MAL2
MCM5
MEGF11
MIZF
MTHFD1
MYO1A
NDUFB10
NFYA
NPAL2
NUDCD1 OCA2
OR10A2
OR2F1
OR3A3
OR52M1
OVOS2
PARD6B
PARVBPDZD11
PKP1
POLR2JPOU5F1
POU5F1P3
PSD2
PTPN13
PVRL2
RAD50
RBM7
RGMA
RGN
RGS1
RIT1
RNF180
RP11.114G1.1
SCG3
SERPINA5
SERPINC1
SHC3
SIRPG
SLC28A1
SLC2A13
SLC5A9
SLC7A3
SPANXD
SPRED2
SRF
SRGN
STK19TACSTD1
TAS2R4
TERT
TGM7
THBS4
THOC7
TIA1
TLK2
TM6SF1
TMEM16C
TOM1L1
TRIM50
TSPY2
UBE1L
UBE2C
UBXD4VAPA
VIPR1
WDFY4
WDR12
YES1
ZBTB39
ZNF121
Figure 3-9. Directed Glioblastoma genetic network learned by the three-stage method withmethylation effects having been adjusted.
p-values calculated from GLM
Xi ∼ 1 + Xj +∑k∈Sij
Xk +W1 +W1 + ... +Wq, (3–11)
in testing the hypothesis H0 : β′ij = 0 versus H0 : β
′ij = 0, where Sij is as defined in Algorithm
3.1 and β′ij is the regression coefficient of Xj .
The dataset we considered is for the glioblastoma (GBM) cancer, a highly malignant
brain tumor for which no cure is available. The dataset consists of both methylation and gene
expression data (mRNA-array data) and was downloaded from TCGA at https://tcga-data.nci.nih.gov/tcga/.
We filtered the dataset by including only the 1000 genes with most variable gene expression
values. The number of samples/patients is 281. It is known that the gene expression value
75
can be affected by the methylation sites in the promoter region. For this reason, we annotated
features of the 1000 genes according to their positions on the chromosome.
In the literature, there are a few work made an integrative analysis of gene expression
data and methylation data. For example, Wang et al. (2013) proposed an integrative Bayesian
genomic (iBAG) model, where the direct effects of methylation on gene expression were
first inferred and combined with the gene expression data to predict clinical outcomes. For
this example, we are interested in finding out how genes are regulated by each other after
methylation effects are adjusted. Therefore, we treat the methylation features as covariates.
Since the association-ship between the variables (genes) and covariates (methylation features)
is clear, only covariates associated with Xi and Xj need to be included in (3–10). Similarly, only
the covariates associated with Xi , Xj and XSij\i ,j need to be included in (3–11). Figure 3.5.2
shows the Bayesian network produced by the three-stage method with α1 = 0.2, α2 = 0.01
and α3 = 0.01. There are 262 edges identified in total. Increasing the values of α2 and α3,
more dense networks can be obtained. The total CPU time is about 3 hours on a 3.6GHZ
desktop. Our result is very meaningful: many of the connections indicated by the network has
been verified in the literature. For this example, we can identify several hub genes, e.g., UBE1L
and SHC3. It is known that UBE1L can cause cancer growth suppression (Feng et al., 2008)
and SCH3 affects human high-grade astrocytomas survival (Magrassi et al., 2005).
3.6 Discussion
We have proposed a three-stage method for learning the structure of high-dimensional
Bayesian networks. The three-stage method is to first learn the moral graph of the Bayesian
network based on variable screening and multiple hypothesis tests, and then resolve the local
structure of the Markov blanket for each variable based on conditional independence tests, and
finally identify the derived directions for non-convergent connections based on logical rules. We
justified the consistency of the three-stage method under the small-n-large-p scenario. The
numerical results indicate that the three-stage method significantly outperforms the existing
ones, such as PC, grow-shrink, hill-climbing, and max-min hill-climbing methods.
76
The time complexity of the three-stage method is dominated by its first two stages.
As analyzed in Section 3.2, the time complexity of the first stage is O(p2) and that of
second stage is O(pm2m−1), where m = maxi∈V|Si | denotes the maximum Markov
blanket size of each node. Hence the computational complexity of three-stage method is
O(maxp2, pm2m−1), which is bounded by O(p22m−1). This is much better than the
PC algorithm, which has a computational complexity of O(p2+m) under the same sparsity
assumption. This analysis is consistent with the CPU times reported in Section 3.5.1.
In this paper, we employ the logical rules to derive directions for non-convergent edges.
Alternatively, we can employ a stochastic optimization method, e.g., simulated annealing
(Kirkpatrick et al., 1983) or stochastic approximation annealing (Liang et al., 2014), to
direct those edges by optimizing a selected score function. Liang et al. (2014) showed that
the stochastic approximation annealing algorithm can converge to the global optimum in
probability with a square-root cooling schedule, under which the temperature can decrease
much faster than logarithmic cooling schedule required by simulated annealing. Further,
motivated by hybrid methods, we can employ the stochastic optimization method to resolve the
Markov blankets and identify the derived directions. This will be further studied in our future
research.
Extension of the proposed method to some other types of mixed data is quite straightforward.
For example, for non-Gaussian continuous variables, the nonparanormal transformation
proposed by Jia et al. (2017) can be applied to Gaussianize the data prior to applying
the three-stage method. For Poisson random variables, the random-effect model-based
transformation proposed by can be first applied to continuize the data and the nonparanormal
transformation can then be applied to Gaussianize the data. The negative binomial data can
be treated in the same way. For some other types of discrete data, we wight regroup and treat
them as multinomial data.
Finally, we note that the moral graph produced in the first stage of the three-stage
method is a Markov network. Learning Markov networks for mixed data is also of great interest
77
in the current literature. For example, Cheng et al. (2013) proposed a conditional Gaussian
distribution-based method and Fan et al. (2017) proposed a semiparametric latent variable
method to tackle the problem. The conditional Gaussian distribution used in Cheng et al.
(2013) is similar to (3–3) but including more interaction terms. They used the nodewise
regression method to estimate the Markov network. The semiparametric latent variable method
works by introducing a latent Gaussian variable for each of the discrete variables and then
estimating the Markov network using a regularization method. However, as stated in Fan
et al. (2017), the conditional independence between the latent variables does not imply the
conditional independence between the observed discrete variables.
78
CHAPTER 4CONCLUSIONS AND FUTURE RESEARCH
In this thesis we focused on approaches to solving graphical models on general type of
data.
First, we presented a Poisson graphical model to construct gene regulatory network based
on next generation sequencing data. Compared with the existing local Poisson graphical
model which suffers from inconsistency and thereby can only infer certain local structures
of network, the proposed method is consistent in the sense that the truge gene regulatory
network can be recovered from RNA-seq data when the sample size becomes large. We used
a random effect model-based transformation to continuize NGS data and then we transform
the continuized data to Gaussian via a semiparametric transformation and apply an equivalent
partial correlation selection method to reconstruct gene regulatory networks. Simulation results
demonstrate that the reconstruction accuracy advantages the algorithms presented over LPGM,
TPGM and SPGM.
The major contribution of the proposed method lies on the data-continuized transformation,
which fills the theoretical gap of how to transform NGS data to continuous data and facilitates
learning of gene regulatory networks.
An avenue of further research in this direction, would be the development of providing a
general framework for how to integrate different types of omics data, such as the RNA-seq and
microarray data, to increase statistical power.
Second, we presented an independence-based approach to learning the structure of
Bayesian networks for mixed type of data. The proposed method consists of three stages,
which is to first learn the moral graph of the Bayesian network based on the techniques of
variable screening and multiple hypothesis tests, and then resolve the local structure of the
Markov blanket for each variable based on conditional independence tests, and finally identify
the derived directions for non-convergent connections based on logical rules. The numerical
results indicate that the proposed method performs significantly better than the existing
79
method, such as PC, grow-shrink, hill-climbing, and max-min hill-climbing methods. We also
justified the consistency of the three-stage method under the small-n-large-p scenario. Further
research can be conducted by employing a stochastic optimization method to derive directions
for non-convergent edges instead of using logical rules.
A subject of future research is the extension of Bayesian network to some other types of
mixed data. For example, for non-Gaussian continuous random variables, the nonparanormal
transformation proposed by Liu et al. (2009) can be applied to Gaussianize the data prior
to applying the three-stage method. Since the directions are included in Bayesian network,
it can represent more types of conditional independences than undirected graphical models.
We expect that this method will be used in constructing gene regulatory networks from next
generating sequencing data in future.
The advent of high-throughput techniques has enabled a decreasing running cost of gene
expression and other genomic feature profiling. With so many data, how to conduct systematic
studies of cancer genomes is of great importance. Graphical model is a very natural tool to
learn associations among a large number of genomic features. The methods demonstrated in
this thesis provides us tools to construct both undirected and directed graphical models.
80
APPENDIX ACONSISTENCY OF TRANSFORMATION-BASED METHOD
A.1 Proof of Lemma 1
We first work on the posterior mean of αi . For any ϵ > 0, the mean of the full conditional
posterior distribution of αi is
E [αi |θij , βi , yi ] =∫ ϵ/4
0
αi f (αi |θij , βi , yi)dαi +∫ ϵ/2
ϵ/4
αi f (αi |θij , βi , yi)dαi+∫ ∞
ϵ/2
αi f (αi |θij , βi , yi)dαi f (αi |θij , βi , yi)dαi + ϵ/4 + (I1) + (I2).
It is easy to see that (I1) ≤ ϵ/2. To evaluate (I2), we rewrite f (αi |θij , βi , yi) = g(αi)e−b1αi ,
where g(αi) is an integrable function. Let m = minαi∈[ϵ/4,ϵ/2] g(αi) and M = minαi∈[ϵ/2,∞) g(αi),
which are known to take finite values. Then
I2I1≤ Mm
∫∞ϵ/2e−b1αidαi∫ ϵ/2
ϵ/4e−b1αidαi
=M
m
1
eb1ϵ/4 − 1→ 0,
as b1 →∞. Therefore, E [αi |θij , βi , yi ]→ 0, if b1 →∞.
Since βi |αi , θij , yi follows Gamma(nαi + a2,∑nj=1 θij + b2), we have E [βi |αi , θij , yi ]→ 0 as
b2 →∞. With the same argument, we have
E [θij |αi , βi , yi ] = (yij + αi)/(βi + 1)→ yij ,
as b1 → ∞ and b2 → ∞. By the law of iterated expectations, we have E [θij |yi ] → yij as
b1 →∞ and b2 →∞.
A.2 Existing Theory of Adaptive MCMC
Since the prior hyperparameters are changing with iterations, the resulting posterior
distribution is also changing with iterations. Hence, the proposed sampling algorithm falls into
the class of adaptive MCMC algorithms. For this type of adaptive MCMC algorithm for which
the target distribution changes with iterations, the ergodicity theory has been developed in Fort
et al. (2011) and Liang et al. (2016). Here we adopted the theory developed by Liang et al.
(2016).
81
To facilitate our study, we first define some notations for adaptive Markov chains.
Consider a state space (X,F , where F = B(X) denotes the Borel set defined on X. Let
Xt ∈ X denote the state of the Markov chain at iteration t, and let Pγt denote the transition
kernel at iteration t, where γt is a realization of a Y-valued random variable Γt . In simulations,
γt is updated according to a specific rule. Let Gt = σ(X0, ...,Xt , Γ0, ..., Γt) be the filtration
generated by (Xi , Γi)ti=0. Thus,
Let P tγ(x ,N) = Pγ(Xt ∈ B|X0 = x) denote the t-step transition probability for
the Markov chain with the fixed transition kernel Pγ and the initial condition X0 = x . Let
P t((x , γ),B) = P(Xt ∈ B|X0 = x , Γ0 = γ),B ∈ F , denote the t-step transition probability
for the adaptive Markov chain with initial conditions X0 = x and Γ0 = γ. Let
T (x , γ, t) =∥ P t((x , γ), ·)− π(·) ∥= supB∈F |P t((x , γ),B)− π(B)|
denote the total variation distance between the distribution of the adaptive Markov chain
at time t and the target distribution π(·). It is said the adaptive Markov chain ergodic of
limt→∞T (x , γ, t) = 0 for all x ∈ X and γ ∈ Y.
For the proposed algorithm, since Γt = (b(t)1 , b
(t)2 ) takes values in a deterministic
sequence, the ergodicity theory developed in Liang et al. (2016) can be re-stated as follows.
Theorem A.1. (Ergodicity; Liang et al. (2016)) Consider an adaptive Markov chain defined on
the state space (X,F) with the adaption index Γt ∈ Y. The adaptive Markov chain is ergodic
if the following conditions are satisfied:
(a) (stationary) There exists a stationary distribution πγt(·) for each transition kernel Pγt ,where γt denotes a realization of the random variable Γt .
(b) (Asymptotic Simultaneous Uniform Ergodicity) For any ϵ > 0, there exist constantsK(ϵ) > 0 and N(ϵ) > 0 such that
supx∈X∥ PnΓt(x , ·)− π(·) ∥≤ ϵ,
for all t > K(ϵ) and n > N(ϵ).
82
Theorem A.2. (Weak Law of Large Numbers; Liang et al. (2016)) Consider an adaptive
Markov chain defined on the state space (X,F). Suppose that conditions (a), (b) and (c) of
Theorem A.1 hold. Let λ(·) be a bounded measurable function. Then
1
n
n∑t=1
λ(Xt)→ π(λ), in probability,
as n →∞, where π(λ) =∫X λ(x)π(dx).
A.3 Proof of Lemma 2
Since the low of β(t)i and the law of θ(t)ij are completely determined by the law of α(t)i ,
where the supscript t indicates the iteration number, out analysis concentrates on the con-
vergence of α(t)i . For notational simplicity, we rewrite b(t)1 and γt and rewrite f (αi |θij , βi , yij)
as fγt(x) in what follows. For the proposed algorithm, γt takes values in a deterministic and
monotone sequence as specified in Equation (2–5) of the main text.
Since the MH algorithm was used for simulating from fγt(x)m the condition (a) holds.
As shown below, for the proposed algorithm, the posterior distribution π(·) converges to a
Dirac delta measure. Hence, following from Theorem 2, the posterior mean can be obtained by
setting λ(x) to a truncated function: λ(x) = x if |x | < M and M otherwise, provided that M
is large enough such that the interval [−M,M] covers all y ′ijs . In summary, to prove Lemma 2,
it suffices to verify the conditions (b) and (c).
Verification of condition (c). Write the target density function as
fγt(x) = g(x)e−γtx ,
where γt is the adaptive parameter taking the form
γt = γt−1 +c
tς, t = 1, 2, ...,
83
for some constants γ0 > 0, c > 0 and ς ∈ (0, 1]. Let q(x , y) = q(|y − x |) denote a
random-walk proposal distribution. Define
sγ(x , y) = q(x , y)min1,g(y)e−γy
g(x)e−γxq(y , x)
q(x , y),
and rγ(x , y) = sγ(x , y)/q(x , y). Then, for any Borel set B, the transition kernel
Pγ(X ,B) =
∫B
sγ(x , y)dy + I (x ∈ B)[1−
∫Xsγ(x , z)dz
].
For the derivative dsγ(x , y)/dγ, we have
|dsγ(x , y)/dγ| = |q(x , y)I (rγ(x , y) < 1)rγ(x , y)(y − x)| ≤ q(x , y)|y − x |.
By the mean-value theorem, there exists a constant c1 such that
|sγ(x , y)− sγ′(x , y)|dy ≤ c2|γ′ − γ|,
as the proposal is a random walk proposal. Therefore,
|pγ(x ,B)− Pγ′(x ,B)| ≤ 2c2|γ − γ′|,
which implies that there exists a constant c2 such that∫X |sγ(x , y)− sγ′(x , y)|dy ≤ c2|γ
′ − γ|,
as the proposal is a random walk proposal. Therefore,
|Pγ(x ,B)− Pγ′(x ,B)| ≤ 2c2|γ − γ′|,
and,
Dt = supx∈X|Pγt+1(x , ·)− Pγt(x , ·)| ≤ 2
c2c0(t + 1)ς
→ 0,
as t →∞.
84
Verification of condition (b). Let P(x, B) denote a degenerated MH transition kernel
for the Dirac delta measure π(x) = δ(x = 0), i.e.,
p(x ,B) =
1, if 0 ∈ B.
0, otherwise
Then it is easy to see that sup ∥ Pγt(x ,B)− p(x ,B) ∥→ 0 as t →∞.
For any k ≥ 1 and any ψ : X → [−1, 1], we have
Pkγtψ(x0)− π(ψ) = S1(k) + S2(k),
where π(ψ) =∫ψ(x)π(x)dx , and
S1(k) = Pkψ(x0)− π(ψ), S2(k) = Pkγtψ(x0)− P
kψ(x0).
Since P(x ,B) is degenerated, we have S1(k) = 0 for all k ≥ 1. For the term S2(k), we can
further decompose it as follows: For any k0(1 ≤ k0 < k),
|S2(k)| ≤ |Pkγtψ(x0)− Pk0γt ψ(x0)|+ |P
k0γt ψ(x0)− P
kψ(x0)|+ |Pk0ψ(x0)− Pkψ(x0)|
=|k0−1∑m=0
[PmPk0−mγt ψ(x0)− Pm+1Pk0−(m+1)γt ψ(x0)]|+ |Pkγtψ(x0)− Pk0γt ψ(x0)|+
|Pkψ(x0)− pk0ψ(x0)|
=|k0−1∑m=0
Pm(Pγt − P)Pk0−(m+1)γt ψ(x0)|+ |Pkγtψ(x0)− Pk0γt ψ(x0)|+ |P
kψ(x0)− Pk0ψ(x0)|.
Since supx ∥ Pγt(x ,B) − p(x ,B) ∥→ 0 as t → ∞, for any ϵ > 0, there exist some L(ϵ) such
that for any t > L(ϵ),
|S2(k)| ≤ 4k0ϵ+ |Pkγtψ(x0)− Pk0γt ψ(x0)|+ |P
kψ(x0)− Pk0ψ(x0)|
= 4k0ϵ+ S3(t, k , k0) + S4(k , k0).
Since P(x ,B) is a degenerated, we have S4(k , k0) = 0 for any k > 0 and k0 > 0. As shown in
(A.3), γt forms a monotone and deterministic sequence. With such a deterministic sequence,
85
Pγt converges faster and faster as t → ∞. Hence, there exists some K(ϵ) and L′(ϵ) such that
for any k > k0 ≥ K(ϵ), t ≥ L′(ϵ),
S3(t, k , k0) ≤ ϵ.
Let L(ϵ) = maxL(ϵ),L′(ϵ. Furthermore, one can choose K(ϵ) such that ϵK(ϵ) → 0 as
ϵ→ 0.
Setting ϵ = ε/(4K(ε) + 1) and summarizing the results of S1(k) and S2(k), we conclude
the following: for any ϵ > 0 and any x0 ∈ X , there exists L(ϵ) ∈ N and k(ϵ) ∈ N such that for
any t > L(ϵ) and k > K(ϵ),
∥ Pkγt(θ0, ·)− π(·) ∥≤ ς.
Note that ς = (4K(ϵ) + 1)→ 0 as ϵ→ 0. Condition (b) is verified.
86
APPENDIX BCONSISTENCY OF PROPOSED THREE-STAGE METHOD
This appendix establishes the consistency of the proposed three-stage method for learning
the structure of Bayesian networks under small-n-large-p scenario. It consists of two parts. The
first part establishes the consistency of the moral graph learning algorithm, and the second part
establishes the consistency of the v -structure identification algorithm
B.1 Consistency of Moral Graph Learning
To indicate that p can grow as a function of n, we rewrite p as pn, rewrite the distribution
function P in 3–1 as P(n), and rewrite the true Bayesian network G as G(n) = (V(n),E(n)).
Let G(n) = (V(n), E (n)) denote the marginal association network, where V(n) = V(n) and the
association is measured by the coefficients of the marginal regression
Xi ∼ 1 + Xj , i , j = 1, 2, ..., pn, (B–1)
which can be normal linear regression or multiclass logistic regression depending on the type
of Xi . Let γij denote the coefficient of Xj in B–1, which is called the marginal regression
coefficient (MRC) in this paper. Them we have
E (n) = (i , j) : γij = 0, i , j = 1, ..., pn.
Let νn denote a threshold value of the MRC, let Eνn denote the edge set the network obtained
through MRC thresholding at νn, and let Eνn,i denote the neighborhood of node i in Eνn . That
is, we define
Eνn = (i , j) : |γij | > νn, and Eνn,i = j : j = i , |γij | > νn. (B–2)
For convenience, we call the network with the edge set Eνn the thresholding MRC network.
Similarly, we let βij denote the regression coefficient of Xj in the node-wise GLM
Xi ∼ 1 + Xj +∑
k∈V(n)\i ,j
Xk . (B–3)
87
Following from the total conditioning property of Bayesian networks Pellet & Elisseeff (2008),
which shows that Xj ∈ Si ⇐⇒ Xi ⊥ Xj |V \ Xi ,Xj, we have βij = 0 ⇐⇒ Xj ∈ Si . Let
E(n)mb = (i , j), βij = 0, i , j = 1, ..., pn denote the edge set of the moral graph. We partition
E(n)mb into two subsets E(n)p = (i , j) : βij = 0, γij = 0 and E
(n)s = (i , j) : βij = 0, γij = 0. The
former set contains the parent-child links as well as the spouse links for which the two spouse
variables are marginally dependent. The latter set contains the spouse links for which the two
spouse variables are marginally independent, but dependent conditioned on their common child.
Let Zi = 1,Xi ,1, ...,Xi ,qn′, whereXi ,1, ...,Xi ,qn ⊂ X1,X2, ...,Xpn \ Xi, and qn
is bounded by O(n/log(n)). In this paper, qn is allowed to increase with n at an appropriate
rate. The regression model Xi ∼ Zi is assumed with quasi-likelihood function −l(ZTi ξi, Xi),
where ξi denote the vector of regression coefficients. Let
ξ∗i = argminξiEl(Ziξi ,Xi), (B–4)
be the population parameter, and
ξ∗i = argmin
ξiPnl(Ziξi ,Xi), (B–5)
be the maximum likelihood estimator (MLE), where Pnf (X ,Y ) = n−1∑n
i=1 f (Xi ,Yi) is the
empirical measure and
l(X ; θ) = −[θX − b(θ)− log c(X )], (B–6)
denotes the log-density function (in the canonical form) of the exponential family, where b(·)
and c(·) denote some known functions. Assume that ξ∗i is an interior point of a sufficiently
large, compact and convex set F ∈ Rqn+1. For any pair (Zi ,Xi), the following conditions are
assumed:
(E1) The Fisher information
I (ξi) = E
[∂
∂ξil(ZTi ξi ,Xi)
] [∂
∂ξil(ZTi ξi ,Xi)
]T,
88
is finite and positive at ξi = ξ∗i . Moreover, ∥I (ξi)∥F = supξi∈F,∥z∥=1∥I (ξi)1/2z∥ exists,
where ∥ · ∥ is the Euclidean norm.
(E2) The function l(zTi ξi , xi) satisfies the Lipschitz property with positive constant kn
|l(zTi ξi , xi)− l(zTi ξ′i , xi)|In(zi , xi) ≤ kn|zTi ξi − zTi ξ′i |In(zi , xi),
for ξi , ξ′i ∈ B, where In(zi , xi) = I ((zi , xi) ∈ Ωn) with
Ωn = (z, x) : |(z, x)|∞ ≤ Kn
for some sufficiently large positive constants Kn, and ∥ · ∥∞ being the supremumnorm. In addition, there exists a sufficiently large constant C such that with bn =CknV
−1n (q/n)
1/2 and Vn given in condition C
supξi∈B,∥ξi−ξ∗i ∥≤bn
|E [l(ZTi ξ,Xi)− l(ZTi ξ∗i ,Xi)](1− In(Zi ,Xi))| ≤ o(q/n),
where Vn is the constant given in condition (E3).
(E3) The function l(XTi ξi ,Xi) is convex in ξi , satisfying
E(l(ZTi ξi ,Xi)− l(ZTi ξ∗i ,Xi)) ≥ Vn∥ξi − ξ∗i ∥2
for all ∥ξi − ξ∗i ∥ ≤ bn and some positive constants Vn.
(E4) There exists some positive constants m0, m1, s0, s1 and α, such that for sufficiently larget,
P(|Xj | > t) ≤ (m1 − s1) exp−m0tα, j = 1, ... , pn,
and that
E exp(b(Z
T
i ξi + s0)− b(ZT
i ξi))+ E exp
(b(Z
T
i ξi − s0)− b(ZT
i ξi))≤ s1,
where ξi = βij : βij = 0, j ∈ P(i), Pi = j : (i , j) ∈ E(n)p , and Zi contains the
corresponding predictors, that is, ZT
i ξi = βi0 +∑j∈P(i) Xjβij .
(E5) The variance Var(ZT
i ξi) is bounded from above and below, where Zi and ξi are asspecified in condition (E4).
(E6) Either b′′(·) is bounded or XM = (X1, ... ,Xpn)
T follows an elliptically contoureddistribution, that is,
XM = Σ1/2RU,
and |Eb′(ZTi ξi)(ZT
i ξi − βi0)| is bounded, where U is uniformly distributed on the unitsphere in p-dimensional Euclidean space, independently of the nonnegative random
89
variable R, Σ = Var(XM), and λmax(Σ) = O(nτ) for some constant 0 ≤ τ < 1 − 2κ,
where κ is as defined in Condition (B) in Section 3.2.4.
Assumption (E6) implies that the largest eigenvalue of Σ is allowed to grow with n, but
the growth rate should be restricted. Otherwise, the resulting thresholding correlation network
can be dense. First, it follows from the definition of P(i) and Condition (D) of Section 3.2.4,
there exists a constant c2 such that
miniminj∈P(i)
|γij | ≥ c2n−κ. (B–7)
Lemma 3 concerns the sure screening property of the thresholded associated network,
which follows Theorem 4 of Fan et al. (2010).
Lemma 3. Suppose that the conditions (A), (B), (E1)-(E4) hold.
(i) If Kn = o(n(1−2κ)/(α+2)), then for any c3 > 0, there exists a positive constant c4 such
that
P
(max1≤i ,j≤pn
|γij − γij | ≥ c3n−κ)≤ O(p2n exp(−c4n(1−2κ)α/(α+2))) = o(1). (B–8)
(ii) If, in addition, the condition (D) holds, then by taking νn = c5n−κ with 0 < c5 ≤ c2/2,
we have
P(P(i) ⊆ Eνn,i) ≥ 1−O(pn exp(−c4n(1−2κ)α/(α+2))) = 1− o(1), (B–9)
P(E(n)p ⊆ Eνn) ≥ 1−O(p
2n exp(−c4n(1−2κ)α/(α+2))) = 1− o(1). (B–10)
Lemma 4. Suppose that the condition (A), (B), (E1)-(E6) hold. If Kn = o(n(1−2κ)/(α+2)),
then, for any νn = c5n−κ, we have
P(|ξνn,i ≤ On2κ+τ) ≥ 1−O(pn exp(−c4n(1−2κ)α/(α+2))) = 1− o(1). (B–11)
Since the exact value of 2κ + τ is unknown, we may bound the size of the neighboring set
ξνn,i by O(n/ log(n)) in practice. However, when n is large, n/ log(n) can be too large. An
excessively large size of the set will adversely affect on the power of the moral graph screening
tests. To address this issue, we propose a multiple hypothesis test-based procedure, i.e., step
90
(a)-(ii) for pre-identification of the nonzero marginal association measure. To justify this
procedure, we have the following lemmas.
Lemma 5. Assume conditions (A), (B), (D), (E1)-(E4) hold. If ηn =12c2n
−κ, where c2 is
defined in B–7, then
P[E(n)p ⊂ ξηn ] = 1− o(1), as n →∞.
Proof. Let Aij denote that an error event occurs when testing the hypotheses H0 : γij = 0
versus H1 : γij = 0 for variables Xi and Xj . Let AIij and AIIij denote the false positive and false
negative errors, respectively. Then Aij = AIij ∪ AIIij , where
False positive error AIij : |γij | > c22n−κ and γij = 0,
False negative error AIIij : |γij | ≤ c22n−κ and γij = 0.
(B–12)
By (B–7), minij |γij | ≥ c2n−κ for the links in E(n)p . Therefore, by Lemma 3-(i),
P[Missing a link of E(n)p in ξηn ] ≤ P(max1≤i ,j≤pn
|γij − γij | ≥ c2/2n−κ)≤ o(1), (B–13)
which concludes the proof.
Therefore, based on Lemma 3, Lemma 4 and Lemma 5, we propose to restrict the size of
the set Ai (in Algorithm 3.1) for each node to be
nsizemin
|ξηn,i ,
n
cn1 log(n)
, (B–14)
where cn1 is a small constant, e.g., cn1 = 1, 2, or 3. The value of ηn can be determined
through a simultaneous test for the hypothesis H0 : γij = 0 ↔ H1 : γij = 0, 1 ≤ i < j ≤ pn, at
a significance level of α1.
Lemma 6 concerns the convergence of MLE of the regression coefficients for which all
the true predictors have been included. The lemma is a restatement of Theorem of Fan et al.
(2010).
91
Lemma 6. Assume condition (A), (B), (E1)-(E3) hold. If Kn = o(n1−2κ)/(α+2)), then, for any
constant c7 > 0, there exists a constant c8 > 0 such that
P( max1≤i≤pn
|ξi − ξ∗i | ≥ c7n−κ) ≤ O(pn exp(−c8n(1−2κ)α/(α+2))) = o(1), (B–15)
where ξ∗i is defined in (B–4) and ξi is the MLE of ξ∗i .
Recall that if the Markov blanket Si (of node Xi) is contained in Zi , then ξ∗i ,ik= βij for
j ∈ Si and Xj = Xi ,ik , and ξ∗i ,ik = 0 otherwise.
Let βij denote the estimate of βij obtained in step (c) of Algorithm 3.1. Let ςn denote the
threshold value of βij , and Emb,ςn denote the network obtained through thresholding βij . That
is, we define
Emb,ςn = (i , j) : |βij | > ςn.
To establish the consistency of Emb,ςn , we first note that as implied by Condition (D) the
total conditioning property of Bayesian network, there exists a constant c6 such that the true
regression βij defined in (B–3) satisfy
miniminj∈Si|βij | ≥ c6n−κ, (B–16)
where κ is as defined in Condition (B). Let xi∗ denote the edge set of a marginal association
network for which each node has a degree of O(n/ log(n)), adjacent with O(n/ log(n))
highest associated nodes. It follows from Lemmas 3 and 4 that
P[E(n)p ⊆ ξ∗
]≥ 1−O(p2n exp(−c4n(1−2κ)α/(α+2))) = 1− o(1). (B–17)
Let B = ∪pni=1Bi , where Bi us defined in step (b) of the p-screening algorithm. We have
E(n)s ⊂ B. Further, by Lemma 5, we have E(n)mb = E
(n)p ∪ E
(n)s ⊆ (ξ∗ ∩ ξηn) ∪ B.
Lemma 7 establishes the consistency of Emb,ςn as an estimate of E(n)mb conditioned on
E(n)mb ⊆ (ξ∗ ∩ ξηn) ∪ B. Its proof follows closely the proof of Lemma 5 on (B–16) and is thus
omitted here.
92
Lemma 7. Assume that the conditions (A), (B), (C), (D) and (E1)-(E6) hold and E(n)mb ⊆
(ξ∗ ∩ ξηn) ∪ B is true. Let ςn =12c6n
−κ. If Kn = o(n(1−2κ)/(α+2), then
P[Emb,ςn = E
(n)mb|E
(n)mb ⊆ (ξ∗ ∩ ξηn) ∪ B
]= 1− o(1), as n →∞.
As a summary for the above results, we have the following theorem, which establishes the
consistency of Emb,ςn as an estimate of the adjacency matrix of the moral graph E(n)mb.
Theorem B.1. Consider a Bayesian network distribution P(n) defined in 3–1 for mixed
GLM variables. Assume the conditions (A), (B), (C), (D) and (E1)-(E6) hold. If Kn =
o(n1−2κ)/(α+2)), then
P[Emb,ςn = E
(n)mb
]≥ 1− o(1), as n →∞.
Proof. By invoking Lemma 5, B–17, and Lemma 7, we have
P[Emb,ςn = E
(n)mb
]≥ P
[Emb,ςn = E
(n)mb|E
(n)mb ⊆ (ξ∗ ∩ ξηn) ∪ B
]P[E(n)mb ⊆ (ξ∗ ∩ ξηn) ∪ B
]≥ [1− o(1)][1− o(1) + 1− o(1)− 1]
= 1− o(1),
which concludes the proof.
B.2 Consistency of v -structure Identification
Since the consistency of the collider set algorithm has been proved by Pellet & Elisseeff
(2008) for the low-dimensional problem, the algorithm is correct and we only need to prove
that the total errors, including both type-I and type-II errors, of the condtional independence
tests involved in the algorithm can be kept at a zero level in probability as n and p goes to
infinity.
Let Dij = D : D ⊆ Si \ j denote the set of all possible subset of Si \ j. Therefore,
the cardinality of Dij is 2|Si |−1, which is upper bounded by 2m−1. Hence m denotes the upper
bound of the Markov blanket size. Let F = D : Xi ⊥p Xj |XD,D ∈ Dij. Then, it follows from
93
Condition (D) that there exists a constant c9 > 0 such that
miniminj∈SiminD∈F
βij |D ≥ c9n−κ, (B–18)
where βij |D denotes the regression coefficient defined in (3–6). Let λn denote the critical value
of the test of the hypothesis: H0 : βij |D = 0 versus H1 : βij |D = 0. Then a v -structure can
be identified if we find a set D ⊂ Dij such that βij |D < λn. Let E(n)
v ,λn denote the v -structure
identified with the critical value λn.
Theorem B.2. Consider a Bayesian network with distribution P(n) defined in (3–1) for mixed
GLM variables. Assume the conditions (A), (B), (C), (D) and (E1)-(E3) hold. Let λn =c92n−κ.
If Kn = o(n(1−2κ)/(α+2)), then
P[E(n)
n,λn = E(n)v |Emb,ςn = E
(n)mb] ≥ 1− o(1), as n →∞.
Proof. Let Aij |D denote an error event occurs when testing the hypothesis H0 : βij |D = 0 versus
H1 : βij |D = 0 for variables Xi and Xj . Let AIij |D and AIIij |D denote the false positive and false
negative errors, respectively. Then Aij |D = AIij |D ∪ AIIij |D, where
False positive error AIij : |βij |D| > c92n−κ and βij |D = 0,
False negative error AIIij : |βij |D| ≤ c22n−κ and βij |D = 0.
(B–19)
By (B–18), we have mini minj∈Si minD∈F βij |D ≥ c9n−κ. Therefore, by Lemma 6,
P[E(n)
v ,λn = E(n)v |Emb,ςn = E
(n)mb] ≤ P
(max
1≤i≤pn,j∈Si ,D∈Dij|βij |D − βijD| ≥ c9/2n−κ
)≤ O
(pnm2
m−1 exp(−c8n(1−2κ)α/(α+2)) (B–20)
where m = nb with b as defined in condition (C) of Section 3.2.4. This concluded the
proof.
94
REFERENCES
Aguiar, M., Masse, R. & Gibbs, B. F. (2005). Regulation of cytochrome p450 byposttranslational modification. Drug metabolism reviews 37, 379–404.
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S. & Koutsoukos, X. D.(2010). Local causal and markov blanket induction for causal discovery and feature selectionfor classification part i: Algorithms and empirical evaluation. Journal of Machine LearningResearch 11, 171–234.
Allen, G. I. & Liu, Z. (2013). A local poisson graphical model for inferring networks fromsequencing data. IEEE transactions on nanobioscience 12, 189–198.
Anders, S. & Huber, W. (2010). Differential expression analysis for sequence count data.Genome biology 11, R106.
Barabasi, A.-L. & Albert, R. (1999). Emergence of scaling in random networks. science286, 509–512.
Benjamini, Y., Krieger, A. M. & Yekutieli, D. (2006). Adaptive linear step-upprocedures that control the false discovery rate. Biometrika , 491–507.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal ofthe Royal Statistical Society. Series B (Methodological) , 192–236.
Bouckaert, R. R. (2001). Bayesian belief networks: from construction to inference .
Cantile, M., Scognamiglio, G., Anniciello, A., Farina, M., Gentilcore, G.,Santonastaso, C., Fulciniti, F., Cillo, C., Franco, R., Ascierto, P. A.et al. (2012). Increased hox c13 expression in metastatic melanoma progression. Journal oftranslational medicine 10, 91.
Carvalho, R. H., Haberle, V., Hou, J., van Gent, T., Thongjuea, S., vanIJcken, W., Kockx, C., Brouwer, R., Rijkers, E., Sieuwerts, A. et al. (2012).Genome-wide dna methylation profiling of non-small cell lung carcinomas. Epigenetics &chromatin 5, 9.
Cheng, J., Li, T., Levina, E. & Zhu, J. (2013). High-dimensional mixed graphicalmodels. arXiv:1304.2810 .
Chickering, D. M. (1996). Learning bayesian networks is np-complete. In Learning fromdata. Springer, pp. 121–130.
Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal ofmachine learning research 3, 507–554.
Colombo, D., Maathuis, M. H., Kalisch, M. & Richardson, T. S. (2012).Learning high-dimensional directed acyclic graphs with latent and selection variables. TheAnnals of Statistics , 294–321.
95
Danaher, P., Wang, P. & Witten, D. M. (2014). The joint graphical lasso for inversecovariance estimation across multiple classes. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 76, 373–397.
De Montellano, P. R. O. (2005). Cytochrome P450: structure, mechanism, andbiochemistry. Springer Science & Business Media.
DeKelver, R. C., Lewin, B., Lam, K., Komeno, Y., Yan, M., Rundle, C.,Lo, M.-C. & Zhang, D.-E. (2013). Cooperation between runx1-eto9a and noveltranscriptional partner klf6 in upregulation of alox5 in acute myeloid leukemia. PLoS Genet9, e1003765.
Dempster, A. P. (1972). Covariance selection. Biometrics , 157–175.
Dobra, A., Lenkoski, A. et al. (2011). Copula gaussian graphical models and theirapplication to modeling functional disability data. The Annals of Applied Statistics 5,969–993.
Fan, J., Liu, H., Ning, Y. & Zou, H. (2017). High dimensional semiparametric latentgraphical model for mixed data. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 79, 405–421.
Fan, J. & Lv, J. (2008). Sure independence screening for ultrahigh dimensional featurespace. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70,849–911.
Fan, J., Song, R. et al. (2010). Sure independence screening in generalized linear modelswith np-dimensionality. The Annals of Statistics 38, 3567–3604.
Faner, R., Cruz, T., Casserras, T., Lopez-Giraldo, A., Noell, G., Coca, I.,Tal-Singer, R., Miller, B., Rodriguez-Roisin, R., Spira, A. et al. (2016).Network analysis of lung transcriptomics reveals a distinct b-cell signature in emphysema.American journal of respiratory and critical care medicine 193, 1242–1253.
Feng, Q., Sekula, D., Guo, Y., Liu, X., Black, C. C., Galimberti, F., Shah,S. J., Sempere, L. F., Memoli, V., Andersen, J. B. et al. (2008). Ube1l causeslung cancer growth suppression by targeting cyclin d1. Molecular cancer therapeutics 7,3780–3788.
Fort, G., Moulines, E. & Priouret, P. (2011). Convergence of adaptive andinteracting markov chain monte carlo algorithms. The Annals of Statistics , 3262–3289.
Friedman, J., Hastie, T. & Tibshirani, R. (2008). Sparse inverse covariance estimationwith the graphical lasso. Biostatistics 9, 432–441.
Gallopin, M., Rau, A. & Jaffrezic, F. (2013). A hierarchical poisson log-normal modelfor network inference from rna sequencing data. PloS one 8, e77503.
96
Gharabaghi, M. A. (2016). Diagnostic investigation of birc6 and sirt1 protein expressionlevel as potential prognostic biomarkers in patients with non-small cell lung cancer. TheClinical Respiratory Journal .
Ha, M. J., Sun, W. & Xie, J. (2015). Penpc: A two-step approach to estimate theskeletons of high-dimensional directed acyclic graphs. Biometrics .
Hammersley, J. M. & Clifford, P. (1971). Markov fields on finite graphs and lattices .
Harris, N. & Drton, M. (2013). Pc algorithm for nonparanormal graphical models.Journal of Machine Learning Research 14, 3365–3383.
Heckerman, D., Geiger, D. & Chickering, D. M. (1995). Learning bayesian networks:The combination of knowledge and statistical data. Machine learning 20, 197–243.
Herskovits, E. H. & Cooper, G. F. (2013). Kutato: An entropy-driven system forconstruction of probabilistic expert systems from databases. arXiv preprint arXiv:1304.1088 .
Hoff, P. D. (2007). Extending the rank likelihood for semiparametric copula estimation. TheAnnals of Applied Statistics , 265–283.
Humbert, M., Halter, V., Shan, D., Laedrach, J., Leibundgut, E. O., Baer-locher, G. M., Tobler, A., Fey, M. F. & Tschan, M. P. (2011). Deregulatedexpression of kruppel-like factors in acute myeloid leukemia. Leukemia research 35, 909–913.
Inouye, D. I., Ravikumar, P. & Dhillon, I. S. (2016). Square root graphicalmodels: Multivariate generalizations of univariate exponential families that permit positivedependencies. In JMLR workshop and conference proceedings, vol. 48. NIH Public Access.
Jia, B., Xu, S., Xiao, G., Lamba, V. & Liang, F. (2017). Learning gene regulatorynetworks from next generation sequencing data. Biometrics .
Kalisch, M. & Buhlmann, P. (2007). Estimating high-dimensional directed acyclic graphswith the pc-algorithm. Journal of Machine Learning Research 8, 613–636.
Kirkpatrick, S., Gelatt, C. D., Vecchi, M. P. et al. (1983). Optimization bysimulated annealing. science 220, 671–680.
Kjaerulff, U. B. & Madsen, A. L. (2008). Bayesian networks and influence diagrams.Springer Science+ Business Media 200, 114.
Kolaczyk, E. D. (2009). Statistical analysis of network data: Methods and Models.Springer.
Lam, W. & Bacchus, F. (1994). Learning bayesian belief networks: An approach based onthe mdl principle. Computational intelligence 10, 269–293.
Lauritzen, S. L. (1996). Graphical models, vol. 17. Clarendon Press.
97
Lee, J. & Hastie, T. (2013). Structure learning of mixed graphical models. In ArtificialIntelligence and Statistics.
Lee, J. D. & Hastie, T. J. (2015). Learning the structure of mixed graphical models.Journal of Computational and Graphical Statistics 24, 230–253.
Liang, F., Cheng, Y. & Lin, G. (2014). Simulated stochastic approximation annealing forglobal optimization with a square-root cooling schedule. Journal of the American StatisticalAssociation 109, 847–863.
Liang, F., Jin, I. H., Song, Q. & Liu, J. S. (2016). An adaptive exchange algorithm forsampling from distributions with intractable normalizing constants. Journal of the AmericanStatistical Association 111, 377–393.
Liang, F., Song, Q. & Qiu, P. (2015). An equivalent measure of partial correlationcoefficients for high-dimensional gaussian graphical models. Journal of the AmericanStatistical Association 110, 1248–1265.
Liang, F. & Zhang, J. (2008). Estimating the false discovery rate using the stochasticapproximation algorithm. Biometrika .
Liu, H., Lafferty, J. & Wasserman, L. (2009). The nonparanormal: Semiparametricestimation of high dimensional undirected graphs. Journal of Machine Learning Research 10,2295–2328.
Liu, P., Morrison, C., Wang, L., Xiong, D., Vedell, P., Cui, P., Hua, X., Ding,F., Lu, Y., James, M. et al. (2012). Identification of somatic mutations in non-small celllung carcinomas using whole-exome sequencing. Carcinogenesis 33, 1270–1276.
Magrassi, L., Conti, L., Lanterna, A., Zuccato, C., Marchionni, M., Cassini,P., Arienta, C. & Cattaneo, E. (2005). Shc3 affects human high-grade astrocytomassurvival. Oncogene 24, 5198–5206.
Margaritis, D. (2003). Learning Bayesian network model structure from data. Ph.D. thesis,US Army.
Margaritis, D. & Thrun, S. (1999). Bayesian network induction via local neighborhoods.Tech. rep., DTIC Document.
Mazumder, R. & Hastie, T. (2012). The graphical lasso: New insights and alternatives.Electronic journal of statistics 6, 2125.
McGeachie, M. J., Chang, H.-H. & Weiss, S. T. (2014). Cgbayesnets: conditionalgaussian bayesian network learning and inference with mixed discrete and continuous data.PLoS Comput Biol 10, e1003676.
Meek, C. (1995). Causal inference and causal explanation with background knowledge.In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. MorganKaufmann Publishers Inc.
98
Meinshausen, N. & Buhlmann, P. (2006). High-dimensional graphs and variable selectionwith the lasso. The Annals of Statistics , 1436–1462.
Mizuno, H., Kitada, K., Nakai, K. & Sarai, A. (2009). Prognoscan: a new databasefor meta-analysis of the prognostic value of genes. BMC medical genomics 2, 18.
Mogi, A. & Kuwano, H. (2011). Tp53 mutations in nonsmall cell lung cancer. BioMedResearch International 2011.
Muller, P. (1992). Alternatives to the gibbs sampling scheme .
Nelson, D. R., Koymans, L., Kamataki, T., Stegeman, J. J., Feyereisen, R.,Waxman, D. J., Waterman, M. R., Gotoh, O., Coon, M. J., Estabrook,R. W. et al. (1996). P450 superfamily: update on new sequences, gene mapping, accessionnumbers and nomenclature. Pharmacogenetics and Genomics 6, 1–42.
Nielsen, T. D. & Jensen, F. V. (2009). Bayesian networks and decision graphs. SpringerScience & Business Media.
Patil, G. P., Joshi, S. W. & Rao, C. R. (1968). A dictionary and bibliography ofdiscrete distributions. International Statistical Institute.
Pearl, J. (2014). Probabilistic reasoning in intelligent systems: networks of plausibleinference. Morgan Kaufmann.
Pearl, J. & Verma, T. S. (1995). A theory of inferred causation. Studies in Logic and theFoundations of Mathematics 134, 789–811.
Pellet, J.-P. & Elisseeff, A. (2008). Using markov blankets for causal structurelearning. Journal of Machine Learning Research 9, 1295–1342.
Plant, N. (2007). The human cytochrome p450 sub-family: transcriptional regulation,inter-individual variation and interaction networks. Biochimica et Biophysica Acta (BBA)-General Subjects 1770, 478–488.
Preetam, N., Alain, H. & Maathuis, M. H. (2016). High-dimensional consistency inscore-based and hybrid structure learning. arXiv:1507.02608 .
Ravikumar, P., Wainwright, M. J., Lafferty, J. D. et al. (2010). High-dimensionalising model selection using 1-regularized logistic regression. The Annals of Statistics 38,1287–1319.
Robinson, M. D. & Oshlack, A. (2010). A scaling normalization method for differentialexpression analysis of rna-seq data. Genome biology 11, R25.
Scutari, M. & Denis, J.-B. (2014). Bayesian networks: with examples in R. CRC press.
Spirtes, P. (2010). Introduction to causal inference. Journal of Machine Learning Research11, 1643–1662.
99
Spirtes, P., Glymour, C. N. & Scheines, R. (2000). Causation, prediction, and search.MIT press.
Sultan, M., Schulz, M. H., Richard, H., Magen, A., Klingenhoff, A., Scherf,M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D. et al. (2008).A global view of gene activity and alternative splicing by deep sequencing of the humantranscriptome. Science 321, 956–960.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological) , 267–288.
Tsamardinos, I., Aliferis, C. F. & Statnikov, A. (2003a). Time and sample efficientdiscovery of markov blankets and direct causal relations. In Proceedings of the ninth ACMSIGKDD international conference on Knowledge discovery and data mining. ACM.
Tsamardinos, I., Aliferis, C. F., Statnikov, A. R. & Statnikov, E. (2003b).Algorithms for large scale markov blanket discovery. In FLAIRS conference, vol. 2.
Tsamardinos, I., Brown, L. E. & Aliferis, C. F. (2006). The max-min hill-climbingbayesian network structure learning algorithm. Machine learning 65, 31–78.
Verma, T. S. & Pearl, J. (1991). Equivalence and synthesis of causal models. InUncertainty in artificial intelligence, vol. 6.
Wan, Y.-W., Allen, G. I., Baker, Y., Yang, E., Ravikumar, Pradeep & Liu, Z.(2015). Package ”huge”: High-dimensional undirected graph estimation.
Wang, W., Baladandayuthapani, V., Morris, J. S., Broom, B. M., Manyam, G.& Do, K.-A. (2013). ibag: integrative bayesian analysis of high-dimensional multiplatformgenomics data. Bioinformatics 29, 149–159.
Yahav, I. & Shmueli, G. (2012). On generating multivariate poisson data in managementscience applications. Applied Stochastic Models in Business and Industry 28, 91–102.
Yang, E., Allen, G., Liu, Z. & Ravikumar, P. K. (2012). Graphical models viageneralized linear models. In Advances in Neural Information Processing Systems.
Yang, E., Ravikumar, P. K., Allen, G. I. & Liu, Z. (2013). On poisson graphicalmodels. In Advances in Neural Information Processing Systems.
Yang, X., Zhang, B., Molony, C., Chudin, E., Hao, K., Zhu, J., Gaedigk,A., Suver, C., Zhong, H., Leeder, J. S. et al. (2010). Systematic genetic andgenomic analysis of cytochrome p450 enzyme activities in human liver. Genome research 20,1020–1036.
Yaramakala, S. & Margaritis, D. (2005). Speculative markov blanket discovery foroptimal feature selection. In Data mining, fifth IEEE international conference on. IEEE.
100
Yuan, M. & Lin, Y. (2007). Model selection and estimation in the gaussian graphicalmodel. Biometrika 94, 19–35.
Zhao, J., Li, P., Feng, H., Wang, P., Zong, Y., Ma, J., Zhang, Z., Chen, X.,Zheng, M., Zhu, Z. et al. (2013). Cadherin-12 contributes to tumorigenicity in colorectalcancer by promoting migration, invasion, adhersion and angiogenesis. Journal of translationalmedicine 11, 288.
Zhao, T., Li, X., Liu, H., Roeder, K. & Larry, J. L. (2015). Package ”huge”:High-dimensional undirected graph estimation.
Zhou, H., Brekman, A., Zuo, W.-L., Ou, X., Shaykhiev, R., Agosto-Perez,F. J., Wang, R., Walters, M. S., Salit, J., Strulovici-Barel, Y. et al. (2016).Pou2af1 functions in the human airway epithelium to regulate expression of host defensegenes. The Journal of Immunology 196, 3159–3167.
101
BIOGRAPHICAL SKETCH
Suwa Xu was born in Yixing, Jiangsu, China. She attended Yixing High School in 2004
and was accepted into statistics Program by University of South China Agricultural University
in 2007.
Suwa Xu graduated in 2011 with a bachelor degree in statistics and later was accepted
into Department of Statistics at Rice University as a master student.
After obtaining a master degree, Suwa Xu came to Gainesville for further study as a Ph.D.
student of biostatistics at University of Florida. She received her Ph.D. from the University of
Florida in the summer of 2017.
102