On Heritability of Structural Connectomes
Transcript of On Heritability of Structural Connectomes
On Heritability of Structural Connectomes
by
Jaewon Chung
A thesis submitted to The Johns Hopkins University
in conformity with the requirements for the degree of
Master of Science in Engineering
Baltimore, Maryland
May, 2019
© 2019 by Jaewon Chung
All rights reserved
Abstract
Recent advancements in technology has allowed collection of enormous
amounts of neuroimaging data to study the functionality of human brains.
In order to study the human brain, connectomes, or brain graphs, can be
derived from neuroimaging data, such as diffusion (dMRI) and functional
magnetic resonance imaging (fMRI). GraSPy, an open-source Python package,
was developed in order to leverage recent advances in statistics and random
graph theory to study populations of connectomes. These algorithms were
then applied to show that the structral connectomes are heritable in humans.
Primary Reader: Joshua T. Vogelstein
ii
Table of Contents
Table of Contents iii
List of Tables v
List of Figures vi
1 GraSPy: Graph Statistics in Python 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Library Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Simulations (Figure 1.1a) . . . . . . . . . . . . . . . . . 2
1.2.2 Preprocessing (Figure 1.1b) . . . . . . . . . . . . . . . . 4
1.2.3 Embedding (Figure 1.1c) . . . . . . . . . . . . . . . . . . 4
1.2.4 Hypothesis Testing (Figure 1.1d) . . . . . . . . . . . . . 4
1.2.5 Clustering (Figure 1.1e) . . . . . . . . . . . . . . . . . . 6
1.2.6 Plotting (Figure 1.1f) . . . . . . . . . . . . . . . . . . . . 6
1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Structural Connectomes are Heritable 10
iii
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Diffusion MRI Acquisition & Processing . . . . . . . . 15
2.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Random Dot Product Graphs . . . . . . . . . . . . . . . 16
2.4.3 Adjacency Spectral Embedding . . . . . . . . . . . . . . 17
2.4.4 Choosing the Embedding Dimension . . . . . . . . . . 17
2.4.5 Pass-To-Ranks . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Estimation of Heritability . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Preprocessing Data . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Models of Heritability and Distance Measures . . . . . 20
2.5.3 Kolomogrov-Smirnov Two-Sample Test . . . . . . . . . 21
iv
List of Tables
2.1 P-values given by the Kolmogorov-Smirnov test on the distribu-
tion of distances as shown in Figure 2.1. The null hypothesis is
that the two distributions of test statistics are the same, and the
alternate hypothesis (H1) is that one distribution is stochasti-
cally larger than the other. The null hypothesis is rejected using
significance level α-0.05 for all alternate hypothesis under all
three models of heritability. . . . . . . . . . . . . . . . . . . . . 14
2.2 Participants and their demographics of HCP1200 Dataset. . . 14
v
List of Figures
1.1 Illustration of submodules and procedure for statistical infer-
ence on population of graphs. A detailed description of each
submodule is given in section 2. . . . . . . . . . . . . . . . . . . 2
1.2 Connectome model fitting and complexity. Larval Drosophila
left mushroom body adjacency matrix (unweighted, directed),
followed by random samples from four different statistical mod-
els of connectomes fit using GraSPy: random dot product graph
(RDPG), degree-corrected stochastic block model (DCSBM),
stochastic block model (SBM), and Erdos-Rényi (ER). The bot-
tom left shows the number of parameters for each, as compared
to the 40,000+ number of parameters (possible edges) for the
inhomogeneous Erdos-Rényi (IER) model in which all potential
edges are specified. Blocks are sorted by size (number of mem-
ber vertices) and nodes are sorted by degree within each block.
The block labels correspond to K) Kenyon cells, P) projection
neurons, O) mushroom body output neurons, I) mushroom
body input neurons (Eichler et al., 2017). . . . . . . . . . . . . 5
vi
2.1 Kernel density estimates (KDE) of pairwise distances of connec-
tomes from three different measures: exact, global scale, and
vertex-wise scale, which represent different models of heritabil-
ity. Each column corresponds to a distance measure, and each
row corresponds to a familial relationship. White vertical line
within each KDE represents the mean. Under all three models,
the distances for monozygotic twins were stochastically smaller
than those of dizygotic twins, sibling and unrelated, while dis-
tances for unrelated pairs were stochastically larger than others.
Distances for dizygotic twins were stochastically smaller than
those of siblings and unrelated, but stochastically larger than
those of monozygotic twins. . . . . . . . . . . . . . . . . . . . 13
vii
Chapter 1
GraSPy: Graph Statistics in Python
1.1 Introduction
Graphs, or networks, are a mathematical representation of data that consists
of discrete objects (nodes or vertices) and relationships between these objects
(edges). For example, if thinking of regions of a human brain as vertices, the
edges can represent how strongly each pair of regions are connected to each
other. Since graphs necessarily deal with relationships between nodes, many
of the classical statistical assumptions about independence are violated. Thus,
specific statistical methodology is required for performing robust statistical in-
ference on graphs and populations of graphs (Athreya et al., 2018). GraSPy fills
this gap by providing implementations of algorithms with strong statistical
guarantees, such as graph and multi-graph embedding methods, two-graph
hypothesis testing, and clustering of vertices of graphs. Many of the algo-
rithms implemented in GraSPy are flexible and can operate on graphs that are
weighted or unweighted, as well as directed or undirected. All subsequent
analysis in this thesis was performed using GraSPy.
1
Graph Populat ions
Preprocessing Embedding
Hypothesis Test ing
Clustering
Visualizat ion
Simulat ions
a
b cd
e
f
Figure 1.1: Illustration of submodules and procedure for statistical inference onpopulation of graphs. A detailed description of each submodule is given in section 2.
1.2 Library Overview
Overview of submodules available in GraSPy is summarized in Figure 1.1. The
library contains functionality for fitting and sampling from random graph
models, performing dimensionality reduction on graphs or populations of
graphs (embedding), testing hypotheses on graphs, and plotting of graphs
and embeddings.
The following provides brief overview of different submodules of GraSPy,
and more detailed overview and code usage can be found in the tutorial sec-
tion of GraSPy documentation at https://graspy.neurodata.io/tutorial.
1.2.1 Simulations (Figure 1.1a)
Three classes of random graph models are implemented in GraSPy: 1) Erdos-
Rényi (ER) model, 2) stochastic block model (SBM), and 3) random dot product
graph (RDPG) model. ER model is the simplest model, in which the model
is parameterized by the number of vertices, n, and either p that specifies a
probability of an edge existing between a pair of vertices or m that specifies
2
the exact number of edges. All nodes have the same probability of connection
to each other under the ER model. Unlike ER models, the SBM produces
graphs containing communities, where vertices in each community share
common probabilities of connection to every other community. The SBM is
parameterized by the number of communities, K, a vector of probabilities of a
node belonging to each community, τ, and a probability matrix, B ∈ [0, 1]K×K,
that specifies the probability of edges within and between communities. An
extension of the SBM, the Degree-corrected SBM (DCSBM) has an added
parameter associated with each node that denotes its promiscuity in the graph,
which is its relative degree among the other nodes in its community. Nodes
still share the same relative probabilities of connection to each community,
but the nodes within a community may have heterogeneous expected degrees.
Finally, the RDPG model assumes that each vertex in the graph is associated
with a latent vector in Rd. The probability of an edge existing between pairs
of vertices is determined by the dot product of the associated latent position
vectors (Young and Scheinerman, 2007). The RDPG is parameterized by an
n by d matrix of these latent positions. GraSPy provides implementations
for sampling from each of these graph models given these parameters, as
well as estimating the parameters of a model from a given graph. GraSPy also
allows for weighting functions and directed graphs when sampling from these
models.
3
1.2.2 Preprocessing (Figure 1.1b)
Various utility functions help the user input real data into GraSPy or check
simple attributes about a graph. Some examples include finding the largest
connected component of a graph, finding the intersection or union of con-
nected components across multiple graphs, transforming the weights of a
graph, or checking whether a graph is directed. These functions speed the
user’s workflow when working with real data that may be messy or noisy
before preprocessing.
1.2.3 Embedding (Figure 1.1c)
Inference on random graphs depends on low-dimensional Euclidean repre-
sentation of the vertices of graphs, known as latent positions, typically given by
spectral decompositions of adjacency or Laplacian matrices (Levin et al., 2017).
Adjacency spectral embedding (ASE) and Laplacian spectral embedding (LSE)
are methods for embedding a single graph, and omnibus embedding allows
for embedding multiple graphs into the same dimensions such that the em-
beddings can be meaningfully compared. In addition, GraSPy allows for the
number of embedding dimensions to be automatically chosen by the algorithm
of Zhu and Ghodsi, 2006.
1.2.4 Hypothesis Testing (Figure 1.1d)
Given two graphs, a natural question to ask is whether these graphs are both
random samples from the same generative distribution. GraSPy provides two
types of test for this null hypothesis: semiparametric and nonparametric. Both
4
Figure 1.2: Connectome model fitting and complexity. Larval Drosophila left mush-room body adjacency matrix (unweighted, directed), followed by random samplesfrom four different statistical models of connectomes fit using GraSPy: random dotproduct graph (RDPG), degree-corrected stochastic block model (DCSBM), stochasticblock model (SBM), and Erdos-Rényi (ER). The bottom left shows the number ofparameters for each, as compared to the 40,000+ number of parameters (possibleedges) for the inhomogeneous Erdos-Rényi (IER) model in which all potential edgesare specified. Blocks are sorted by size (number of member vertices) and nodes aresorted by degree within each block. The block labels correspond to K) Kenyon cells,P) projection neurons, O) mushroom body output neurons, I) mushroom body inputneurons (Eichler et al., 2017).
tests are framed under the RDPG model, where the generative distribution
can be modeled as a set of latent positions. The semiparametric test can only
be performed on two graphs of the same size and with known correspondence
between the vertices of the two graphs (Tang et al., 2017). Nonparametric
testing can be performed on graphs without vertex alignment, or even with
5
different numbers of vertices (Tang et al., 2014). Both tests provide a sta-
tistically principled way of claiming whether two observed graphs are the
same; for example, one can test whether the brain connectivity graphs of
siblings or twins came from the same generative distribution (Chung et al., in
preparation).
1.2.5 Clustering (Figure 1.1e)
GraSPy uses Gaussian mixture models (GMM) and k-means to compute the
grouping structure of vertices after embedding. The number of clusters to
fit for GMM is chosen by Bayesian information criterion (BIC), which is a
penalized likelihood function to evaluate the quality of estimators. Similarly,
the silhouette score is used to choose the number of clusters for k-means. Both
functions sweep over a range of parameters and use the above metrics to
choose clustering parameters in an unsupervised manner.
1.2.6 Plotting (Figure 1.1f)
GraSPy extends seaborn to visualize graphs as adjacency matrices and embed-
ded graphs as paired scatter plots (Waskom et al., 2018). Individual graphs
can be visualized using heatmap function, and multiple graphs can be overlaid
on top of each other using gridplot function. Both adjacency matrix visual-
izations can be sorted by various node metadata. pairplot can visualize high
dimensional data, such as graphs in the embedded space, as a pairwise scatter
plot.
6
1.3 Conclusion
GraSPy is the first open-source Python package to perform robust statisti-
cal analysis on graphs and graph populations. Its compliance with the
scikit-learn API makes it an easy-to-use tool for anyone familiar with
machine learning in Python. In addition, GraSPy is implemented with an
extensible class structure, making it easy to modify and add new algorithms
to the package. As GraSPy continues to grow and add functionality, we believe
it will accelerate statistically-valid discovery in any field of study concerned
with populations of graphs.
7
References
Athreya, Avanti, Donniell E. Fishkind, Minh Tang, Carey E. Priebe, YoungserPark, Joshua T. Vogelstein, Keith Levin, Vince Lyzinski, Yichen Qin, andDaniel L Sussman (2018). “Statistical Inference on Random Dot ProductGraphs: a Survey”. In: Journal of Machine Learning Research 18.226, pp. 1–92.URL: http://jmlr.org/papers/v18/17-448.html.
Young, Stephen J and Edward R Scheinerman (2007). “Random dot productgraph models for social networks”. In: International Workshop on Algorithmsand Models for the Web-Graph. Springer, pp. 138–149.
Levin, Keith, Avanti Athreya, Minh Tang, Vince Lyzinski, and Carey E Priebe(2017). “A central limit theorem for an omnibus embedding of multiplerandom dot product graphs”. In: pp. 964–967.
Zhu, Mu and Ali Ghodsi (2006). “Automatic dimensionality selection fromthe scree plot via the use of profile likelihood”. In: Computational Statistics& Data Analysis 51.2, pp. 918–930.
Eichler, Katharina, Feng Li, Ashok Litwin-Kumar, Youngser Park, IngridAndrade, Casey M Schneider-Mizell, Timo Saumweber, Annina Huser,Claire Eschbach, Bertram Gerber, et al. (2017). “The complete connectomeof a learning and memory centre in an insect brain”. In: Nature 548.7666,p. 175.
Tang, Minh, Avanti Athreya, Daniel L Sussman, Vince Lyzinski, YoungserPark, and Carey E Priebe (2017). “A semiparametric two-sample hypoth-esis testing problem for random graphs”. In: Journal of Computational andGraphical Statistics 26.2, pp. 344–354.
Tang, Minh, Avanti Athreya, Daniel L. Sussman, Vince Lyzinski, and Carey E.Priebe (2014). “A nonparametric two-sample hypothesis testing problemfor random dot product graphs”. In: Journal of Computational and GraphicalStatistics, arXiv:1409.2344.
Waskom, Michael, Olga Botvinnik, Drew O’Kane, Paul Hobson, Joel Ost-blom, Saulius Lukauskas, David C Gemperline, Tom Augspurger, Yaroslav
8
Halchenko, John B. Cole, Jordi Warmenhoven, Julian de Ruiter, CameronPye, Stephan Hoyer, Jake Vanderplas, Santi Villalba, Gero Kunter, EricQuintero, Pete Bachant, Marcel Martin, Kyle Meyer, Alistair Miles, YoavRam, Thomas Brunner, Tal Yarkoni, Mike Lee Williams, Constantine Evans,Clark Fitzgerald, Brian, and Adel Qalieh (2018). mwaskom/seaborn: v0.9.0(July 2018). DOI: 10.5281/zenodo.1313201. URL: https://doi.org/10.5281/zenodo.1313201.
9
Chapter 2
Structural Connectomes areHeritable
2.1 Introduction
Understanding the extent to which genes and environment determines hu-
man brain connectivity and structure, or heritability, is of great interest for
improving our understanding of brain function and diseases. To study such
properties, brains are often modelled as connectomes, or brain graphs, by
defining regions of the brain as nodes and the strength of connections between
regions as edges (“Connectal Coding: Discovering the Structures Linking
Cognitive Phenotypes to Individual Histories”). Numerous pipelines have
been developed and applied to diffusion magnetic resonance imaging (dMRI)
to reconstruct white matter fiber-tract trajectories in vivo, which are then used
to derive connectomes (Kiar et al., 2018; Maier-Hein et al., 2017). Graph theo-
retic methods have been applied to such connectomes to examine anatomical
connectivity in healthy subjects, schizophrenia patients, and identical twins
10
(Bullmore and Sporns, 2009; Bohlken et al., 2014; Heuvel et al., 2010; Micheloy-
annis, 2012; Bassett and Bullmore, 2006). This study extends current literature
by applying recent advances in statistics on random graph models to study
the heritability of human structural connectomes (Athreya et al., 2018; Tang
et al., 2017).
Prior work on structural connectomes utilize graph features, such as clus-
tering coefficients, small worldness, average path length, betweenness central-
ity, modularity, motifs, etc (Bullmore and Sporns, 2009; Bohlken et al., 2014;
Heuvel et al., 2010; Micheloyannis, 2012; Bassett and Bullmore, 2006). Some of
these features have simple intuitions, such as small-world topology, which de-
scribes that brain networks tend to form local clusters with dense connections
within a cluster but sparse connections between clusters (Bassett and Bullmore,
2006; Bullmore and Sporns, 2009). However, using these features to explain
differences in connectomes is difficult to interpret since various connectomes
can generate the same feature value (“Connectal Coding: Discovering the
Structures Linking Cognitive Phenotypes to Individual Histories”). This lack
of interpretability of the results also makes translating the results into practice
and new research directions difficult.
In this work, we present a statistically principled procedure for studying
a population of connectomes, and examine the heritability of structural con-
nectomes. Using a random graph model, called random dot product graph
(RDPG), each region of interest (ROI) represents a node and has an associ-
ated set of latent variables that can capture the genetic and environmental
influences on its connectivity to other ROIs (Athreya et al., 2018). The latent
11
variables for connectomes are estimated via spectral decomposition, and eu-
clidean distances between latent variables are computed for all monozygotic,
dizygotic, sibling, and unrelated pairs to obtain distributions of distances
(Tang et al., 2017). The differences in distributions are then validated via
two-sample Kolomgrov-Smirnov tests. We demonstratively show that the
human structural connectomes are highly heritable and that the differences in
brain structure is determined by the differences in the genome.
2.2 Results
Figure 2.1 presents the kernel density estimates (KDEs) from the three distance
measures for all pairs of monozygotic twins, dizygotic twins, siblings and
unrelated individuals. The three distance measures, denoted exact, global
scale, and vertex-wise scale, represent three different models of heritability.
Under all three models of heritability, there is a stochastic ordering of monozy-
gotic, dizygotic, siblings, and unrelated from smallest to largest. The ordering
suggests similarity between two structural connectomes are highly related to
genetic similarity between the individuals. However, the ordering of dizy-
gotic and sibling suggests additional environmental factors or age effects in
structural connectivity patterns.
To formally validate the ordering, the two-sample Kolmogrov-Smirnoff
(KS) test of distributions was employed. Alternate hypothesis were formed
for all possible six pairs of distributions as explained in 2.5.2. Table 2.1 present
the results from KS test. Significance levels are marked with * (p < .05), ** (p <
.01), and *** (p < .001).
12
Mon
ozyg
otic
Exact Global Scale Vertex-wise ScaleDizyg
otic
Sibling
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5Distance
Unrelated
1.0 1.5 2.0 2.5 3.0 3.5 4.0Distance
1 2 3 4 5 6 7Distance
Figure 2.1: Kernel density estimates (KDE) of pairwise distances of connectomes fromthree different measures: exact, global scale, and vertex-wise scale, which representdifferent models of heritability. Each column corresponds to a distance measure, andeach row corresponds to a familial relationship. White vertical line within each KDErepresents the mean. Under all three models, the distances for monozygotic twinswere stochastically smaller than those of dizygotic twins, sibling and unrelated, whiledistances for unrelated pairs were stochastically larger than others. Distances fordizygotic twins were stochastically smaller than those of siblings and unrelated, butstochastically larger than those of monozygotic twins.
The null hypothesis is rejected for all six alternative hypothesis under the
three models of heritability. Thus, the KS tests under the different alternate hy-
pothesis validate the stochastic ordering of familial relationships as suggested
in Figure 2.1. In summary, higher genetic similarity leads to more similar
connectomes, and these differences are statistically significantly different.
2.3 Data Acquisition
2.3.1 Participants
We used publicly available diffusion MRI (dMRI) and structural MRI (sMRI)
data from the S1200 (2017) release of the Human Connectome Project (HCP)
13
Table 2.1: P-values given by the Kolmogorov-Smirnov test on the distribution ofdistances as shown in Figure 2.1. The null hypothesis is that the two distributions oftest statistics are the same, and the alternate hypothesis (H1) is that one distribution isstochastically larger than the other. The null hypothesis is rejected using significancelevel α-0.05 for all alternate hypothesis under all three models of heritability.
Alternate Hypothesis ModelsExact Global Scale Vertex-wise Scale
Monozygotic < Dizygotic 1.78E-08 *** 1.78E-08 *** 2.23E-04 ***Monozygotic < Sibling 1.46E-16 *** 6.26E-17 *** 2.33E-12 ***Monozygotic < Unrelated 9.28E-34 *** 1.33E-32 *** 8.39E-20 ***Dizygotic < Sibling 1.23E-02 * 1.29E-02 * 8.23E-03 **Dizygotic < Unrelated 2.14E-10 *** 3.84E-10 *** 1.42E-08 ***Sibling < Unrelated 9.77E-19 *** 5.65E-19 *** 4.16E-09 ***
Young Adult study, acquired by the Washington University in St. Louis
(WUSTL) and the University of Minnesota (Minn) Van Essen et al., 2013; Van
Essen et al., 2012. Out of the 1206 participants released, 985 had viable dMRI
for processing. Demographics of the 985 participants are described in Table
2.2. All data collection procedures were approved by the institutional review
boards at WUSTL and Minn.
Table 2.2: Participants and their demographics of HCP1200 Dataset.
Zygosity Monozygotic Dizygotic Non-twin siblings
N 250 259 476Sex 167 F, 83 M 140 F, 119 M 223 F, 248 MAge (mean) 29.6 (3.3) 28.9 (3.4) 28.3 (3.9)Age (range) 22-36 22-36 22-37
14
2.3.2 Diffusion MRI Acquisition & Processing
Using dMRI and sMRI, graphs, or connectomes, were estimated using the
ndmg (Kiar et al., 2018) pipeline. The dMRI scans were pre-processed for
eddy currents using FSL’s eddy-correct (Smith, 2004). FSL’s "standard" linear
registration pipeline was used to register the sMRI and dMRI images to
the MNI152 atlas (Smith, 2004; Woolrich, 2009; Jenkinson, 2012; Mazziotta,
2001). A tensor model is fit using DiPy (Garyfallidis et al., 2014) to obtain
an estimated tensor at each voxel. A deterministic tractography algorithm is
applied using DiPy’s EuDX (Garyfallidis et al., 2014; Garyfallidis et al., 2012)
to obtain streamlines, which indicate the voxels connected by an axonal fiber
tract. Graphs are formed by contracting voxels into graph vertices depending
on spatial similarity (Mhembere et al., 2013). In this study, a modified version
of Desikan-Killiany-Tourville (DKT) parcellation (Klein and Tourville, 2012)
was used to define the ROIs. Given a parcellation with vertices V and a
corresponding mapping P(vi) indicating the voxels within a region i, we
contract our fiber streamlines as follows. w(vi, vj) = ∑u∈P(vi) ∑w∈P(vj)I {Fu,w}
where Fu,w is true if a fiber tract exists between voxels u and w, and false if
there is no fiber tract between voxels u and w.
2.4 Preliminaries
2.4.1 Graph
A graph, or network, G, is defined as an ordered set of vertices and edges
(V, E) where V is the vertex set, and E, the set of edges, is a subset of the
15
Cartesian product of V × V. A vertex set is represented as V = {1, 2, . . . , n}
where |V| = n, and an edge exists between i and j if (i, j) ∈ E. A network can
also be represented by its adjacency matrix A ∈ Rn×n where A(i,j) represents
the value of the edge between i and j. Connectomes estimated from dMRIs are
graphs in which nodes are regions of interest (ROI) and edges are the number
of fiber tracts between a pair of ROIs.
2.4.2 Random Dot Product Graphs
Motivated by recent statistical results with strong theoretical guarantees from
(Athreya et al., 2018; Sussman et al., 2012), we choose the random dot product
graph (RDPG) as our model for connectomes (Young and Scheinerman, 2007).
In this model, the each vertex is a region of the brain, and connections between
a pair of regions of the brain are dictated by unobserved latent positions, which
serve as our estimate of effects of genome and environment on connectomes.
Formally, a random dot product graph is defined as follows:
Definition 1. (Random dot product graphs (RDPG)). Let F be a distribution on Rd
such that F satisfies the inner product condition, and for any two elements u, v ∈ F,
uTv ∈ [0, 1]. Let X1, X2, . . . , Xniid∼ F, and X = [X1, X2, . . . , Xn]T ∈ Rn×d whose
rows are in F. Suppose A is a random adjacency matrix given by
P[A |X] = ∏i<j
(XTi Xj)
Aij(XTi Xj)
1−Aij (2.1)
We then write a random dot product graph (RDPG) as (A, X) ∼ RDPG(X), and say
that A is the adjacency matrix of a random dot product graph with latent position X
of rank at most d.
16
We further define the matrix P = (pij) of edge probabilities by P = X XT.
We will also write A ∼ Bernoulli(P) to denote the existence between any two
vertices i and j, where i > j, is a Bernoulli random variable with probability
pij in which the edges are independent. We emphasize that we only consider
undirected graphs with no self-loops with non-negative edge weights.
We note that this model has an inherent non-identifiability. Given latent
position matrix X ∈ Rn×d and an unitary matrix W ∈ Rn×n, the matrix
P = X XT = (X W)(WT XT) are equivalent.
2.4.3 Adjacency Spectral Embedding
Given an adjacency matrix A ∼ RDPG(X, n), a natural task is to recover the
latent positions X that gave rise to A. Adjacency spectral embedding provides
consistent estimates of latent positions (Sussman et al., 2012), and is defined
as follows.
Definition 2. The adjacency spectral embedding (ASE) of A into Rd is given by
X = Ud S1/2d where A = U S UT given by singular value decomposition (SVD) and
chooses the top d singular values and their associated singular vectors.
2.4.4 Choosing the Embedding Dimension
We emphasize that the true dimensionality of latent positions is unknown
and we must choose the embedding dimension. A common methodology for
choosing the number of embedding dimensions in SVD is to visually examine
the scree plot and choose an elbow that separates the top signal dimensions
17
and noise dimensions. In this work, we consider the method of Zhu and
Ghodsi, 2006. Given A = A = U S UT, the singular values S are used to
choose the embedding dimension d via
d = argmaxd
Pro f ileLikelihoodS(d) (2.2)
where Pro f ileLikelihoodS(d) provides the magnitude of the gap after first
d singular values.
2.4.5 Pass-To-Ranks
Since the connectomes are weighted, we describe the pass-to-ranks method
for undirected graphs, which normalizes the edge weights such that Aij ∈
[0, 1] ∀ i, j ∈ {1, 2, . . . , n}.
Definition 3. Given A ∈ Rn×n, let R(Aij) be the “rank” of Aij, that is, R(Aij) = k
if Aij is the kth smallest number above the main diagonal of A. The pass-to-ranks
(PTR) matrix, A, is defined as follows:
Aij =
{R(Aij)e if Aij > 0 ∀ i < j.
0 otherwise.(2.3)
where e = |E| is the number of edges. Ties are broken by averaging the ranks.
Since we only consider undirected graphs, Aij = Aji ∀ i, j ∈ {1, 2, . . . , n}.
We pass-to-ranks all graphs prior to ASE because such representation of
connectomes has been shown to be more reliable than raw values (Kiar et al.,
2018).
18
2.5 Estimation of Heritability
We compared pairs of individuals from the same family, which is defined
as having a shared mother, father or both. Within a family, a pair of indi-
viduals can be monozygotic twins, dizygotic twins, and non-twin siblings.
Furthermore, we examined random pairs of unrelated individuals, which are
defined as not sharing a mother, father, or both, to serve as a control group.
In order to control for the potential effect of age differences when comparing
non-twin siblings, we sample unrelated pairs such that the distribution of age
differences are the same for both non-twin siblings and twins.
2.5.1 Preprocessing Data
Connectomes derived from dMRI can vary in number of edges due to false
positive edges. In order to mitigate the potential effects of different number
of edges in estimating heritability, smallest valued edges were removed from
each graph such that all graphs have the same number of edges, which is the
minimum number of edges across all available graphs. Given m number of
graphs, {A(i)}mi=1, let ei be the number of edges for graph A(i). The minimum
number of edges across all graphs is defined as emin = min{e1, e2, . . . , em}. For
each graph A(i), the smallest valued edges were thresholded such that each
graph have emin number of edges. After thresholding the graphs, each graphs
were passed-to-rank prior to ASE.
19
2.5.2 Models of Heritability and Distance Measures
Given a pair (A1 ∼ RDPG(X), A2 ∼ RDPG(Y)) of graphs on the same vertex
set with known correspondence, that is there is a bijective map ϕ such that
there is a one-to-one correspondence of vertices between the two graphs, Tang
et al., 2017 provides three different measures of equality, or distance, of X and
Y. We consider the following three cases:
Exact: H0 : X = W Y vs H1 : X = W Y (2.4)
Global Scale: H0 : X = c W Y vs H1 : X = c W Y (2.5)
Vertex-wise Scale: H0 : X = D W Y vs H1 : X = D W Y (2.6)
where W ∈ Rd×d is an orthogonal matrix such that X = W Y, c ∈ R is a scalar,
and D ∈ Rd×d is a diagonal matrix. Let X, Y ∈ Rn×d be the estimates of latent
positions obtained from ASE of A1 and A2, respectively. The distances are
given by:
TExact(X, Y) = minW
∥X − W Y∥F (2.7)
TGlobal Scale(X, Y) = minc,W
∥X − c W Y∥F (2.8)
TVertex-wise Scale(X, Y) = minD,W
∥X − D W Y∥F (2.9)
These three distance measures form the three models of heritability. In-
tuitively, the distance tells us how close two graphs are to each other up to
some transformation. We obtain a distribution of distances for monozygotic,
20
dizygotic twins, non-twin siblings, and unrelated pairs under each of the
hertability model.
2.5.3 Kolomogrov-Smirnov Two-Sample Test
Once the distance distributions are obtained for monozygotic, dizygotic, sib-
lings and unrelated pairs, we employ the Kolmogrov-Smirnov (KS) two-
sample test to examine whether one distributions is statistically different from
another distribution. The null hypothesis state that two distributions are
sampled from the same underlying distribution. The test statistic, D, is given
by
D = supx|F1(x)− F2(x)| (2.10)
where F1 and F2 are two empirical distribution functions. We formulate the
following six alternative hypothesis and provide the intuition for the choice
of alternative.
1. Monozygotic < Dizygotic - Since monozygotic twins have identical
genetics, their brain structure should be more similar than those of
dizygotic twins.
2. Monozygotic < Sibling - Similarly, monozygotic twins should have
more similar brain structure than those of siblings.
3. Monozygotic < Unrelated - Since unrelated pairs should have no genetic
similarity, brain structure of unrelated pairs should be more dissimilar
than those of monozygotic twins.
21
4. Dizygotic < Sibling - Dizygotic twins and siblings should have similar
genetic variability within a family. However, there may be additional
effects from differences in age or environment that account for larger
differences in siblings.
5. Dizygotic < Unrelated - Same as 4.
6. Sibling < Unrelated - Same as 4.
22
References
Vogelstein, Joshua, Eric Bridgeford, Benjamin Pedigo, Jaewon Chung, KeithLevin, Brett Mensh, and Carey Priebe. “Connectal Coding: Discoveringthe Structures Linking Cognitive Phenotypes to Individual Histories”. In:Current opinion in neurobiology.
Kiar, Gregory, Eric Bridgeford, Will Gray Roncal, Vikram Chandrashekhar,Disa Mhembere, Sephira Ryman, Xi-Nian Zuo, Daniel S Marguiles, RCameron Craddock, Carey E Priebe, Rex Jung, Vince Calhoun, BrianCaffo, Randal Burns, Michael P Milham, and Joshua Vogelstein (2018).“A High-Throughput Pipeline Identifies Robust Connectomes But Trou-blesome Variability”. In: bioRxiv. DOI: 10.1101/188706. eprint: https://www.biorxiv.org/content/early/2018/04/24/188706.full.pdf. URL:https://www.biorxiv.org/content/early/2018/04/24/188706.
Maier-Hein, Klaus H, Peter F Neher, Jean-Christophe Houde, Marc-AlexandreCôté, Eleftherios Garyfallidis, Jidan Zhong, Maxime Chamberland, Fang-Cheng Yeh, Ying-Chia Lin, Qing Ji, et al. (2017). “The challenge of mappingthe human connectome based on diffusion tractography”. In: Nature com-munications 8.1, p. 1349.
Bullmore, Ed and Olaf Sporns (2009). “Complex brain networks: graph theo-retical analysis of structural and functional systems”. In: Nature ReviewsNeuroscience 10.3, 186âAS198. ISSN: 1471-0048. DOI: 10.1038/nrn2575.
Bohlken, Marc M., RenÃl’ C. W. Mandl, Rachel M. Brouwer, Martijn P. van denHeuvel, Anna M. Hedman, RenÃl’ S. Kahn, and Hilleke E. Hulshoff Pol(2014). “Heritability of structural brain network topology: A DTI study of156 twins”. In: Human Brain Mapping 35.10, 5295âAS5305. ISSN: 1097-0193.DOI: 10.1002/hbm.22550.
Heuvel, Martijn P. van den, RenÃl’ C. W. Mandl, Cornelis J. Stam, RenÃl’ S.Kahn, and Hilleke E. Hulshoff Pol (2010). “Aberrant Frontal and Tempo-ral Complex Network Structure in Schizophrenia: A Graph Theoretical
23
Analysis”. In: Journal of Neuroscience 30.47, 15915âAS15926. ISSN: 0270-6474,1529-2401. DOI: 10.1523/JNEUROSCI.2874-10.2010.
Micheloyannis, Sifis (2012). “Graph-based network analysis in schizophrenia”.In: World Journal of Psychiatry 2.1, 1âAS12. ISSN: 2220-3206. DOI: 10.5498/wjp.v2.i1.1.
Bassett, Danielle Smith and Ed Bullmore (2006). “Small-World Brain Net-works”. In: The Neuroscientist 12.6, 512âAS523. ISSN: 1073-8584. DOI: 10.1177/1073858406293182.
Athreya, Avanti, Donniell E. Fishkind, Minh Tang, Carey E. Priebe, YoungserPark, Joshua T. Vogelstein, Keith Levin, Vince Lyzinski, Yichen Qin, andDaniel L Sussman (2018). “Statistical Inference on Random Dot ProductGraphs: a Survey”. In: Journal of Machine Learning Research 18.226, pp. 1–92.URL: http://jmlr.org/papers/v18/17-448.html.
Tang, Minh, Avanti Athreya, Daniel L. Sussman, Vince Lyzinski, YoungserPark, and Carey E. Priebe (2017). “A Semiparametric Two-Sample Hypoth-esis Testing Problem for Random Graphs”. In: Journal of Computationaland Graphical Statistics 26.2, pp. 344–354. DOI: 10.1080/10618600.2016.1193505. eprint: https://doi.org/10.1080/10618600.2016.1193505.URL: https://doi.org/10.1080/10618600.2016.1193505.
Van Essen, David C, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens,Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. (2013). “TheWU-Minn human connectome project: an overview”. In: Neuroimage 80,pp. 62–79.
Van Essen, David C, Kamil Ugurbil, E Auerbach, D Barch, TEJ Behrens, RBucholz, Acer Chang, Liyong Chen, Maurizio Corbetta, Sandra W Cur-tiss, et al. (2012). “The Human Connectome Project: a data acquisitionperspective”. In: Neuroimage 62.4, pp. 2222–2231.
Smith, Stephen M et al. (2004). “Advances in functional and structural MRimage analysis and implementation as FSL.” In: NeuroImage 23 Suppl 1,S208–19. ISSN: 1053-8119. URL: http://www.ncbi.nlm.nih.gov/pubmed/15501092.
Woolrich, Mark W et al. (2009). “Bayesian analysis of neuroimaging datain FSL.” In: NeuroImage 45.1 Suppl, S173–86. ISSN: 1095-9572. URL: http://www.sciencedirect.com/science/article/pii/S1053811908012044.
Jenkinson, Mark et al. (2012). “FSL.” In: NeuroImage 62.2, pp. 782–90. ISSN:1095-9572. URL: http://www.ncbi.nlm.nih.gov/pubmed/21979382.
24
Mazziotta, John et al. (2001). “A four-dimensional probabilistic atlas of thehuman brain”. In: Journal of the American Medical Informatics Association 8.5,pp. 401–430.
Garyfallidis, Eleftherios, Matthew Brett, Bagrat Amirbekian, Ariel Rokem,Stefan Van Der Walt, Maxime Descoteaux, and Ian Nimmo-Smith (2014).“Dipy, a library for the analysis of diffusion MRI data”. In: Frontiers inneuroinformatics 8, p. 8.
Garyfallidis, Eleftherios, Matthew Brett, Marta Correia, Guy Williams, andIan Nimmo-Smith (2012). “QuickBundles, a Method for TractographySimplification”. In: Frontiers in Neuroscience 6, p. 175.
Mhembere, Disa, William Gray Roncal, Daniel Sussman, Carey E Priebe, RexJung, Sephira Ryman, R Jacob Vogelstein, Joshua T Vogelstein, and RandalBurns (2013). “Computing scalable multivariate glocal invariants of large(brain-) graphs”. In: Global Conference on Signal and Information Processing(GlobalSIP), 2013 IEEE. IEEE, pp. 297–300.
Klein, Arno and Jason Tourville (2012). “101 Labeled Brain Images and aConsistent Human Cortical Labeling Protocol”. In: Frontiers in Neuroscience6, p. 171. ISSN: 1662-453X. DOI: 10.3389/fnins.2012.00171. URL: https://www.frontiersin.org/article/10.3389/fnins.2012.00171.
Sussman, Daniel L, Minh Tang, Donniell E Fishkind, and Carey E Priebe (2012).“A consistent adjacency spectral embedding for stochastic blockmodelgraphs”. In: Journal of the American Statistical Association 107.499, pp. 1119–1128.
Young, Stephen J. and Edward R. Scheinerman (2007). “Random Dot ProductGraph Models for Social Networks”. In: Lecture Notes in Computer Science.Ed. by Anthony Bonato and Fan R. K.Editors Chung, 138âAS149.
Zhu, Mu and Ali Ghodsi (2006). “Automatic dimensionality selection fromthe scree plot via the use of profile likelihood”. In: Computational Statistics& Data Analysis 51.2, pp. 918–930.
25
Curriculum Vitae
Jaewon Chung was born on July 8th in Seoul, South Korea. He graduated from
Wesleyan University with a degree in neuroscience & behavior and economics.
Since then, he worked at various hospitals in New York. He first worked as
clinical researcher then as a data analyst for the NYU Langone Hospital. In
2017, Jaewon started his Master’s Degree in biomedical engineering at Johns
Hopkins University where he studied various topics, such as fluorescent
microscopy image segmentation and studying populations of human brains.
In 2019, Jaewon will continue his studies at Johns Hopkins University as a
PhD student in biomedical engineering.
26