Statistical Analysis of Network Data

75
Statistical Analysis In this next unit we will look at methods that approach network analysis from a statistical inference perspective. 1 / 74 Statistical Analysis In particular we will look at three statistical inference and learning tasks over networks Analyzing edges between vertices as a stochastic process over which we can make statistical inferences Constructing networks from observational data Analyzing a process (e.g., diffusion) over a network in a statistical manner 2 / 74 https://doi.org/10.1016/j.prevetmed.2014.01.013 Spatial eects in ecological networks 3 / 74 Exponential Random Graph Models In ER random graph model edge probabilities were independent of vertex characterisitics. Now assume vertices have measured attributes. Question: what is the effect of these attributes in network formation, specifically in edge occurence. 4 / 74 Exponential Random Graph Models Denote as adjacency matrix of graph over elements Denote as matrix of vertex attributes. We want to determine where is a measure of structure of graph and is the configuration of edges other than edge 5 / 74 Exponential Random Graph Models We can motivate ERGM model from regression (where outcome is continuous) 6 / 74 Exponential Random Graph Models We turn into a probabilistic model as 7 / 74 Exponential Random Graph Models For binary outcome we use logistic regression Which corresponds to a Bernoulli model of . 8 / 74 Exponential Random Graph Models The outcome of interest in the ERGM model is the presence of edge . Use a Bernoulli model with as the outcome. With vertex attributes and graph structural measure as predictors. 9 / 74 Exponential Random Graph Models Model 1: the ER model Thinking of logistic regression: model is a constant, independent of rest of graph structure, independent of vertex attributes 10 / 74 Exponential Random Graph Models To fit models we need a likelihood, i.e., probability of observed graph, given parameters (in this case ) 11 / 74 Exponential Random Graph Models To fit models we need a likelihood, i.e., probability of observed graph, given parameters (in this case ) Write as , then likelihood is given by 12 / 74 Exponential Random Graph Models (Exercise) where is the number of edges in the graph. This is the formulation given in reading! 13 / 74 ## Observations: 36 ## Variables: 9 ## $ name <chr> "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", … ## $ Seniority <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,… ## $ Status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ Gender <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ Office <int> 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 2, … ## $ Years <int> 31, 32, 13, 31, 31, 29, 29, 28, 25, 25, 23, 24, 22, 1,… ## $ Age <int> 64, 62, 67, 59, 59, 55, 63, 53, 53, 53, 50, 52, 57, 56… ## $ Practice <int> 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, … ## $ School <int> 1, 1, 1, 3, 2, 1, 3, 3, 1, 3, 1, 2, 2, 1, 3, 1, 1, 2, … Exponential Random Graph Models 14 / 74 library(ergm) A <- get.adjacency(lazega) lazega.s <- network::as.network(as.matrix(A), directed=FALSE) ergm.bern.fit <- ergm(lazega.s ~ edges) ergm.bern.fit ## ## MLE Coefficients: ## edges ## -1.499 Exponential Random Graph Models So and thus 15 / 74 Exponential Random Graph Models 16 / 74 : number of stars Exponential Random Graph Models The ER model is not appropriate, let's extend with more graph statistics. : number of triangles 17 / 74 Exponential Random Graph Models In practice, instead of adding terms for structural statistics at all values of , they are combined into a single term For example alternating star counts is a parameter that controls decay of influence of larger terms. Treat as a hyperparameter of model 18 / 74 Exponential Random Graph Models Another example is geometrically weighted degree count There is a good amount of literature on definitions and properties of suitable terms to summarize graph structure in these models 19 / 74 Exponential Random Graph Models In addition we want to adjust edge probabilities based on vertex attributes For edge , may have attribute that increases degree (e.g., seniority) Or, and have attributes that together increase edge probability (e.g., spatial distance in an ecological network) 20 / 74 Exponential Random Graph Models We can add attribute terms to the ERGM model accordingly. E.g., Main effects: Categorical interaction (match): Numeric interaction: 21 / 74 Exponential Random Graph Models A full ERGM model for this data: lazega.ergm <- formula(lazega.s ~ edges + gwesp(log(3), fixed=TRUE) + nodemain("Seniority") + nodemain("Practice") + match("Practice") + match("Gender") + match("Office")) 22 / 74 Exponential Random Graph Models ## # A tibble: 1 x 5 ## independence iterations logLik AIC BIC ## <lgl> <int> <dbl> <dbl> <dbl> ## 1 FALSE 2 -230. 474. 505. 23 / 74 Exponential Random Graph Models estimate std.error statistic p.value 7.0065546 0.6711396 10.439787 0.0000000 0.5916556 0.0855375 6.916915 0.0000000 0.0245612 0.0061996 3.961761 0.0000744 0.3945545 0.1021836 3.861229 0.0001128 0.7696627 0.1906006 4.038093 0.0000539 0.7376656 0.2436241 3.027885 0.0024627 1.1643929 0.1875340 6.208970 0.0000000 24 / 74 Exponential Random Graph Models 25 / 74 Exponential Random Graph Models A few more points: The general formulation for ERGM is where represents possible configurations of possible edges among a subset of vertices in graph if configuration occurs in graph 26 / 74 Exponential Random Graph Models This brings about some complications since it's infeasible to define function over all possible configurations Instead, collapse configurations into groups based on certain properties, and count the number of times these properties are satisfied in graph Even then, computing normalization term is also infeasible, therefore use sampling methods (MCMC) for estimation 27 / 74 https://doi.org/10.1371/journal.pbio.1002527 Stochastic Block Models 28 / 74 Stochastic Block Models Method to cluster vertices in graph Assume that each vertex belongs to one of classes Then probability of edge depends on class/cluster membership of vertices and 29 / 74 Stochastic Block Models Clustered ERGM model If we knew vertex classes, e.g., belongs to class and belongs to class 30 / 74 Stochastic Block Models Clustered ERGM model Likelihood is then with the number of edges where in class and in class (a model like g~match(class) in ERGM) 31 / 74 Stochastic Block Models However, suppose we don't know vertex class assignments... SBM is a probabilistic method where we maximize likelihood of this model, assuming class assignments are unobserved 32 / 74 Stochastic Block Models 33 / 74 Stochastic Block Models edge (binary) indicator for vertex class (binary) prior for class : probability of edge where in class and in class 34 / 74 Stochastic Block Models With this we can write again a likelihood with 35 / 74 Stochastic Block Models Like similar models (e.g., Gaussian mixture model, Latent Dirichlet Allocation) can't optimize this directly Instead EM algorithm used: Initialize parameters Repeat until "convergence": Compute Maximize likelihood w.r.t. plugging in for . 36 / 74 Stochastic Block Models Like similar models need to determine number of classes (clusters) and select using some model selection criterion AIC BIC Integrated Classification Likelihood 37 / 74 Stochastic Block Models 38 / 74 Communities If we think of class as community we can see relationship with non probabilistically community finding methods (e.g., NewmanGirvan) See http://www.pnas.org/content/106/50/21068.full for comparison of these methods 39 / 74 Summary Slightly different way of thinking probabilistically about networks Define probabilistic model over network configurations Parameterize model using network structural properties and vertex properties Perform inference/analysis on resulting parameters Can also extend classical clustering methodology to this setting 40 / 74 Learning Network Structure How to find network structure from observational data Gene Coexpression 41 / 74 Learning Network Structure How to find network structure from observational data Functional Connectivity 42 / 74 Learning Network Structure Correlation Networks The simplest approach: compute correlation between observations, if correlation high, add an edge 43 / 74 Learning Network Structure Assume data (e.g., gene expression of gene in different conditions) and Important quantity 1: the covariance of and 44 / 74 Learning Network Structure Important quantity 1: the covariance of and How do and vary around their means? 45 / 74 Learning Network Structure We can estimate from data by plugging in the mean of and . We would notate the estimate as . In the following, often means , it should follow from context. 46 / 74 Learning Network Structure We often need to compare quantities across different entities in system, e.g., genes, so we want to remove scale Pearson's Correlation: With the standard deviation of : 47 / 74 Learning Network Structure Note Pearson Correlation is between 1 and 1, it is hard to perform inference on bounded quantities, so one more transformation. Fisher's transformation 48 / 74 Edge inference: hypothesis test Compute value from Learning Network Structure 49 / 74 Learning Network Structure Perform hypothesis test for every pair of entities, i.e., possible edge We would compute value for each possible edge When performing many independent tests, values no longer have our intended interpretation 50 / 74 Learning Network Structure Multiple Hypothesis Testing Called Significant Not Called Significant Total Null True Altern. True Total Note: total tests 51 / 74 Learning Network Structure Error rates Familywise error rate (FWER): the probability of at least one Type I error (false positive) We use Bonferroni procedure to control FWER. If testing at level (e.g., ), only include egdes for which value 52 / 74 Learning Network Structure Error rates False Discovery Rate (FDR): rate that false discoveries occur We use BenjaminiHochberg procedure to control FDR. Construct list of edges at FDR level (e.g. ) if , where is the value for the th largest value. Note: there are other more precise FDR controlling procedures (esp. values) 53 / 74 Learning Network Structure The problem with Pearson's correlation Consider the following networks, where absence of edge corresponds to true conditional independence between vertices in graph In all three of these, Pearson's correlation test with is likely statistically significant. 54 / 74 Learning Network Structure Let's extend the way we think about the situation. First consider covariance matrix for , and 55 / 74 Learning Network Structure We can then think about the covariance of and conditioned on How do and covary around their conditional means and 56 / 74 Learning Network Structure Partial correlation networks This leads to the concept of partial correlation (which we can derive from the conditional covariance) 57 / 74 Learning Network Structure Partial correlation networks What's the test now? No edge if and are conditionally independent (there is some such that ) Formally: 58 / 74 Learning Network Structure Partial correlation networks To determine edge compute value as where is a value computed from (transformed) partial correlation Use multiple testing correction as before 59 / 74 Learning Network Structure Problems with partial correlation networks For every edge, must compute partial correlation wrt. every other vertex Compound hypothesis tests like the above are harder to control for multiple testing (i.e., correction mentioned above is not quite right) The dependence structure they represent is unclear 60 / 74 Learning Network Structure Here we turn to a very powerful abstraction, thinking of graphs as a way of describing the joint distribution of gene expression measurements (Probabilistic Graphical Models). 61 / 74 Graphical Models Consider each complete vector of expression measurements at each time Suppose some conditional independence properties hold for some variables in , Example: variable and are independent given remaining variables in 62 / 74 Graphical Models We can encode these conditional independence properties in a graph. 63 / 74 Graphical Models HammersleyClifford theorem: all probability distributions that satisfy conditional independence properties in a graph can be written as is the set of all cliques in a graph, a specific clique and the variables in the clique. 64 / 74 Graphical Models The probability distribution is determined by the choice of potential functions . Example: 1. 2. 3. for 65 / 74 1. if there is an edge between and 2. otherwise Graphical Gaussian Models Define matrix as 66 / 74 Graphical Gaussian Models With this in place, we can say that is distributed as multivariate normal distribution . Connection to partial correlation: We can think about distribution of and conditioned on the rest of the graph and the (partial) correlation of and under this distribution 67 / 74 Banerjee, et al. ICML 2006, JMLR 2008, Friedman Biostatistics 2007 Sparse Inverse Covariance With this framework in place we can now think of network structure inference. Main idea: given draws from multivariate distribution (i.e., expression vector at each time point), Estimate a sparse inverse correlation matrix, get graph from the pattern of 0's in the estimated matrix 68 / 74 Sparse Inverse Covariance Maximum Likelihood estimate of inverse covariance is given by solution to is the estimated sample covariance matrix (Yuck) 69 / 74 Sparse Inverse Covariance We can induce zeros in the solution using a penalized likelihood estimate where (Yuckier) 70 / 74 Sparse Inverse Covariance Block-coordinate ascent Solve by maximizing one column of matrix at a time (edges for each variable, e.g., below) with and 71 / 74 Sparse Inverse Covariance Block-coordinate ascent Solve by maximizing one column of matrix at a time (edges for each variable, e.g., below) Solution is then (This is l1regularized least squares, easy to solve, not yucky at all) 72 / 74 Sparse Inverse Covariance Block-coordinate ascent Solve by maximizing one column of matrix at a time (edges for each variable, e.g., below) Iterate over columns of matrix until converges (or even better, until objective function converges) 73 / 74 Summary Using Gaussian Graphical Model representation multivariate normal probability over a sparse graph take resulting graph as e.g., gene network Use sparsityinducing regularization (l1norm) Blockcoordinate ascent method leads to l1regularized regression at each step Can use efficient coordinate descent (softthresholding) to solve regression problem 74 / 74 Statistical Analysis of Network Data Héctor Corrada Bravo University of Maryland, College Park, USA CMSC828O 20191021

Transcript of Statistical Analysis of Network Data

Page 1: Statistical Analysis of Network Data

Statistical AnalysisIn this next unit we will look at methods that approach network analysisfrom a statistical inference perspective.

1 / 74

Statistical AnalysisIn particular we will look at three statistical inference and learning tasksover networks

Analyzing edges between vertices as a stochastic process over whichwe can make statistical inferences

Constructing networks from observational data

Analyzing a process (e.g., diffusion) over a network in a statisticalmanner

2 / 74https://doi.org/10.1016/j.prevetmed.2014.01.013

Spatial e�ects in ecological networks

3 / 74

Exponential Random Graph ModelsIn ER random graph model edge probabilities were independent ofvertex characterisitics.

Now assume vertices have measured attributes.

Question: what is the effect of these attributes in network formation,specifically in edge occurence.

4 / 74

Exponential Random Graph ModelsDenote   as adjacency matrix of graph   over   elements

Denote   as matrix of vertex attributes.

We want to determine 

where   is a measure of structure of graph   and   is theconfiguration of edges other than edge 

Y G n

X

P(Yij = 1|Y−(ij) = y−(ij),xi,xj,L(G))

L(G) G y−(ij)

i ∼ j

5 / 74

Exponential Random Graph ModelsWe can motivate ERGM model from regression (where outcome   iscontinuous)

Y

E[Yi|xi] =p

∑j=1

βjxij = β ′xi

6 / 74

Exponential Random Graph ModelsWe turn into a probabilistic model as

Y = β ′xi + ϵ

ϵ ∼ N(0,σ2)

7 / 74

Exponential Random Graph ModelsFor binary outcome   we use logistic regression

Which corresponds to a Bernoulli model of  .

Y

log = β ′xiP(Yi = 1|xi)

1 − P(Yi = 1|xi)

P(Yi = 1|xi)8 / 74

Exponential Random Graph ModelsThe outcome of interest in the ERGM model is the presence of edge 

.

Use a Bernoulli model with   as the outcome.

With vertex attributes and graph structural measure as predictors.

yij = 1

yij

9 / 74

Exponential Random Graph ModelsModel 1: the ER model

Thinking of logistic regression: model is a constant, independent of restof graph structure, independent of vertex attributes

log = θP(Yij = 1|Y−(ij) = y−(ij))

P(Yij = 0|Y−(ij) = y−(ij))

10 / 74

Exponential Random Graph ModelsTo fit models we need a likelihood, i.e., probability of observed graph,given parameters (in this case  )θ

11 / 74

Exponential Random Graph ModelsTo fit models we need a likelihood, i.e., probability of observed graph,given parameters (in this case  )

Write   as  , then likelihood is given by

θ

P(Yij = 1|. . . ) p

L(θ; y) =∏ij

pyij(1 − p)(1−yij)

12 / 74

Exponential Random Graph Models(Exercise)

where   is the number of edges in the graph.

This is the formulation given in reading!

L(θ; y) = exp{θL(y)}1

κ

L(y)

13 / 74

## Observations: 36

## Variables: 9

## $ name <chr> "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", …

## $ Seniority <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…

## $ Status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

## $ Gender <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

## $ Office <int> 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 2, …

## $ Years <int> 31, 32, 13, 31, 31, 29, 29, 28, 25, 25, 23, 24, 22, 1,…

## $ Age <int> 64, 62, 67, 59, 59, 55, 63, 53, 53, 53, 50, 52, 57, 56…

## $ Practice <int> 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, …

## $ School <int> 1, 1, 1, 3, 2, 1, 3, 3, 1, 3, 1, 2, 2, 1, 3, 1, 1, 2, …

Exponential Random Graph Models

14 / 74

library(ergm)

A <- get.adjacency(lazega)

lazega.s <- network::as.network(as.matrix(A), directed=FALSE)

ergm.bern.fit <- ergm(lazega.s ~ edges)

ergm.bern.fit

##

## MLE Coefficients:

## edges

## -1.499

Exponential Random Graph ModelsSo 

and thus 

θ = −1.5

p = 0.183

15 / 74

Exponential Random Graph Models

16 / 74

: number of  ­stars

Exponential Random Graph ModelsThe ER model is not appropriate, let's extend with more graph statistics.

: number of  ­trianglesSk(y) k Tk(y) k

17 / 74

Exponential Random Graph ModelsIn practice, instead of adding terms for structural statistics at all values of, they are combined into a single term

For example alternating  ­star counts

 is a parameter that controls decay of influence of larger   terms. Treatas a hyper­parameter of model

k

k

AKSλ(y) =Nv−1

∑k=2

(−1)kSk(y)

λk−2

λ k

18 / 74

Exponential Random Graph ModelsAnother example is geometrically weighted degree count

There is a good amount of literature on definitions and properties ofsuitable terms to summarize graph structure in these models

GWDγ(y) =Nv−1

∑d=0

e−γdNd(y)

19 / 74

Exponential Random Graph ModelsIn addition we want to adjust edge probabilities based on vertexattributes

For edge  ,   may have attribute that increases degree (e.g.,seniority)

Or,   and   have attributes that together increase edge probability (e.g.,spatial distance in an ecological network)

i ∼ j i

i j

20 / 74

Exponential Random Graph ModelsWe can add attribute terms to the ERGM model accordingly. E.g.,

Main effects: Categorical interaction (match): Numeric interaction: 

h(xi,xj) = xi + xjh(xi,xj) = I(xi == xj)

h(xi,xj) = (xi − xj)2

21 / 74

Exponential Random Graph ModelsA full ERGM model for this data:

lazega.ergm <- formula(lazega.s ~ edges + gwesp(log(3), fixed=TRUE) +

nodemain("Seniority") +

nodemain("Practice") +

match("Practice") +

match("Gender") +

match("Office"))

22 / 74

Exponential Random Graph Models## # A tibble: 1 x 5

## independence iterations logLik AIC BIC

## <lgl> <int> <dbl> <dbl> <dbl>

## 1 FALSE 2 -230. 474. 505.

23 / 74

Exponential Random Graph Models

estimate std.error statistic p.value

­7.0065546 0.6711396 ­10.439787 0.0000000

0.5916556 0.0855375 6.916915 0.0000000

0.0245612 0.0061996 3.961761 0.0000744

0.3945545 0.1021836 3.861229 0.0001128

0.7696627 0.1906006 4.038093 0.0000539

0.7376656 0.2436241 3.027885 0.0024627

1.1643929 0.1875340 6.208970 0.0000000

24 / 74

Exponential Random Graph Models

25 / 74

Exponential Random Graph ModelsA few more points:

The general formulation for ERGM is

where   represents possible configurations of possible edges among asubset of vertices in graph

 if configuration   occurs in graph

Pθ(Y = y) = exp{∑H

gH(y)}1

κ

H

gH(y) = ∏yij∈Hyij = 1 H

26 / 74

Exponential Random Graph ModelsThis brings about some complications since it's infeasible to definefunction over all possible configurations

Instead, collapse configurations into groups based on certain properties,and count the number of times these properties are satisfied in graph

Even then, computing normalization term   is also infeasible, thereforeuse sampling methods (MCMC) for estimation

κ

27 / 74https://doi.org/10.1371/journal.pbio.1002527

Stochastic Block Models

28 / 74

Stochastic Block ModelsMethod to cluster vertices in graph

Assume that each vertex belongs to one of   classes

Then probability of edge   depends on class/cluster membership ofvertices   and 

Q

i ∼ j

i j

29 / 74

Stochastic Block Models

Clustered ERGM model

If we knew vertex classes, e.g.,   belongs to class   and   belongs toclass 

i q j

r

log = θqrP(Yij = 1|Y−(ij) = y−(ij))

P(Yij = 0|Y−(ij) = y−(ij))

30 / 74

Stochastic Block Models

Clustered ERGM model

Likelihood is then

with   the number of edges   where   in class   and   in class 

(a model like g~match(class) in ERGM)

L(θ; y) = exp{∑qr

θqrLqr(y)}1

κ

Lqr(y) i ∼ j i q j

r

31 / 74

Stochastic Block ModelsHowever, suppose we don't know vertex class assignments...

SBM is a probabilistic method where we maximize likelihood of thismodel, assuming class assignments are unobserved

32 / 74

Stochastic Block Models

33 / 74

Stochastic Block Models edge   (binary) indicator for vertex   class   (binary) prior for class   : probability of edge   where   in class   and   in class 

Yij i ∼ j

ziq i q

αq q p(ziq = 1) = αq

πqr i ∼ j i q j r

34 / 74

Stochastic Block ModelsWith this we can write again a likelihood

with 

L(θ; y, z) =∑i

∑q

ziq logαq + ∑i≠j

∑q≠r

ziqzjrb(yij;πqr)1

2

b(y,π) = πy(1 − π)(1−y)

35 / 74

Stochastic Block ModelsLike similar models (e.g., Gaussian mixture model, Latent DirichletAllocation) can't optimize this directly

Instead EM algorithm used:

Initialize parameters Repeat until "convergence":

Compute Maximize likelihood w.r.t.   plugging in   for  .

θ

γiq = E{ziq|y; θ} = p(ziq|y; θ)θ γiq ziq

36 / 74

Stochastic Block ModelsLike similar models need to determine number of classes (clusters) andselect using some model selection criterion

AICBICIntegrated Classification Likelihood

37 / 74

Stochastic Block Models

38 / 74

CommunitiesIf we think of class as community we can see relationship with non­probabilistically community finding methods (e.g., Newman­Girvan)

See http://www.pnas.org/content/106/50/21068.full for comparison ofthese methods

39 / 74

SummarySlightly different way of thinking probabilistically about networks

Define probabilistic model over network configurations

Parameterize model using network structural properties and vertexproperties

Perform inference/analysis on resulting parameters

Can also extend classical clustering methodology to this setting

40 / 74

Learning Network StructureHow to find network structure from observational data

Gene Co­expression

41 / 74

Learning Network StructureHow to find network structure from observational data

Functional Connectivity

42 / 74

Learning Network Structure

Correlation Networks

The simplest approach: compute correlation between observations, ifcorrelation high, add an edge

43 / 74

Learning Network StructureAssume data   (e.g., gene expression of gene in   different conditions) and 

Important quantity 1: the covariance of   and 

yi = {yi1, yi2, … , yiT} i

T yj

yi yj

σij =T

∑t=1

(yit − ¯̄̄y i)(yjt − ¯̄̄y j)1

T

44 / 74

Learning Network StructureImportant quantity 1: the covariance of   and 

How do   and   vary around their means?

yi yj

σij =T

∑t=1

(yit − ¯̄̄y i)(yjt − ¯̄̄y j)1

T

yi yj

45 / 74

Learning Network StructureWe can estimate   from data by plugging in the mean of   and  .

We would notate the estimate as  .

In the following,   often means  , it should follow from context.

σij yi yj

σ̂ij

σij σ̂ij

46 / 74

Learning Network StructureWe often need to compare quantities across different entities in system,e.g., genes, so we want to remove scale

Pearson's Correlation:

With   the standard deviation of  :

ρij =σij

σiiσjj

σii yi

σii =

T

∑i=1

(yit − ¯̄̄y i)1

T47 / 74

Learning Network StructureNote Pearson Correlation is between ­1 and 1, it is hard to performinference on bounded quantities, so one more transformation.

Fisher's transformation

zij = tanh−1(ρij) = log1

2

1 + ρij

1 − ρij

48 / 74

Edge inference: hypothesistest

Compute  ­value   from 

Learning Network Structure

H0 = ρij = 0 HA = ρij ≠ 0

P pijN(0,√1/(T − 3))

49 / 74

Learning Network StructurePerform hypothesis test for every pair of entities, i.e., possible edge 

We would compute  ­value for each possible edge

When performing many independent tests,  ­values no longer have ourintended interpretation

i j

P

P

50 / 74

Learning Network Structure

Multiple Hypothesis Testing

Called Significant Not Called Significant Total

Null True

Altern. True

Total

Note:   total tests

V m0 − V m0

S m1 − S m1

R m − R m

m

51 / 74

Learning Network Structure

Error rates

Family­wise error rate (FWER): the probability of at least one Type Ierror (false positive) 

We use Bonferroni procedure to control FWER.

If testing at level   (e.g.,  ), only include egdes for which  ­value 

FWER = Pr(V ≥ 1)

α α = 0.05 P

pij ≤ α/m

52 / 74

Learning Network Structure

Error rates

False Discovery Rate (FDR): rate that false discoveries occur 

We use Benjamini­Hochberg procedure to control FDR.

Construct list of edges at FDR level   (e.g.  ) if  ,where   is the  ­value for the  ­th largest  ­value.

Note: there are other more precise FDR controlling procedures (esp.  ­values)

FDR = E(V /R;R > 0)Pr(R > 0)

β β = 0.1 p(k) ≤ βkm

p(k) p k p

q

53 / 74

Learning Network Structure

The problem with Pearson's correlation

Consider the following networks, where absence of edge corresponds totrue conditional independence between vertices in graph

In all three of these, Pearson's correlation test with   is likelystatistically significant.

ρij

54 / 74

Learning Network StructureLet's extend the way we think about the situation. First considercovariance matrix for  ,   and i j k

Σ =⎛⎜ ⎜⎝

σ2ii σij σik

σji σ2jj σjk

σki σkj σ2kk

⎞⎟ ⎟⎠

55 / 74

Learning Network StructureWe can then think about the covariance of   and   conditioned on 

How do   and   co­vary around their conditional means   and 

i j k

Σij|k = (σ2ii σij

σji σ2jj

) − σ−2kk (

σ2ik

σikσjk

σikσjk σ2jk

)

yi yj E(yi|yk)E(yj|yk)

56 / 74

Learning Network Structure

Partial correlation networks

This leads to the concept of partial correlation (which we can derive fromthe conditional covariance)

ρij|k =ρij − ρikρjk

√(1 − ρ2ik

)√(1 − ρ2jk

)

57 / 74

Learning Network Structure

Partial correlation networks

What's the test now? No edge if   and   are conditionally independent(there is some   such that  )

Formally:

i j

k ρij|k = 0

H0 : ρij|k = 0 for some k ∈ V∖{i,j}

HA : ρij|k ≠ 0 for all k

58 / 74

Learning Network Structure

Partial correlation networks

To determine edge   compute  ­value   as

where   is a  ­value computed from (transformed) partial correlation 

Use multiple testing correction as before

i ∼ j P pij

pij = max{pij|k : k ∈ V∖{i,j}}

pij|k P

ρij|k

59 / 74

Learning Network Structure

Problems with partial correlation networks

For every edge, must compute partial correlation wrt. every other vertex

Compound hypothesis tests like the above are harder to control formultiple testing (i.e., correction mentioned above is not quite right)

The dependence structure they represent is unclear

60 / 74

Learning Network StructureHere we turn to a very powerful abstraction, thinking of graphs as a wayof describing the joint distribution of gene expression measurements(Probabilistic Graphical Models).

61 / 74

Graphical ModelsConsider each complete vector of expression measurements at eachtime 

Suppose some conditional independence properties hold for somevariables in  ,

Example: variable   and   are independent given remainingvariables in 

y

y

y2 y3

y

62 / 74

Graphical ModelsWe can encode these conditional independence properties in a graph.

63 / 74

Graphical ModelsHammersley­Clifford theorem: all probability distributions that satisfyconditional independence properties in a graph can be written as

 is the set of all cliques in a graph,   a specific clique and   thevariables in the clique.

P(y) = exp{∑c∈C

fc(yc)}1

Z

C c yc

64 / 74

Graphical ModelsThe probability distribution is determined by the choice of potentialfunctions  . Example:

1. 2. 3.   for 

fc

fc(yi) = − τiiy2i

12

fc({yi, yj}) = − τijyiyj12

fc(yc) = 0 |yc| ≥ 3

65 / 74

1.   if there is an edge

between   and 

2.   otherwise

Graphical Gaussian ModelsDefine matrix   asΣ−1

Σ−1ij = τij

yi yj

Σ−1ij = 0

Σ−1 =

⎛⎜ ⎜ ⎜ ⎜ ⎜⎝

τ 211 τ12 0 τ14

τ12 τ 222 τ23 0

0 τ23 τ 233 τ34

τ14 0 τ34 τ 244

⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ 66 / 74

Graphical Gaussian ModelsWith this in place, we can say that   is distributed as multivariate normaldistribution  .

Connection to partial correlation: We can think about distribution of and   conditioned on the rest of the graph   and the (partial)correlation of   and   under this distribution

y

N(0, Σ)

yiyj V∖{i,j}

yi yj

ρij|V∖{i,j}= −

τij

τiiτjj

67 / 74

Banerjee, et al. ICML 2006, JMLR 2008, Friedman Biostatistics 2007

Sparse Inverse CovarianceWith this framework in place we can now think of network structureinference.

Main idea: given draws from multivariate distribution   (i.e., expressionvector at each time point),

Estimate a sparse inverse correlation matrix, get graph from the patternof 0's in the estimated matrix

y

68 / 74

Sparse Inverse CovarianceMaximum Likelihood estimate of inverse covariance is given by solutionto

 is the estimated sample covariance matrix

(Yuck)

maxX≻0

log detX − (SX)

S

S =T

∑t=1

ytyt′

69 / 74

Sparse Inverse CovarianceWe can induce zeros in the solution using a penalized likelihoodestimate

where

(Yuckier)

maxX≻0

log detX − (SX) − λ∥X∥1

∥X∥1 =∑ij

Xij

70 / 74

Sparse Inverse Covariance

Block-coordinate ascent

Solve by maximizing one column of matrix at a time (edges for eachvariable, e.g.,   below)

with   and

x12

minβ

∥X1/211 β − z∥2 + λ∥β∥1

1

2

z = W1/2

11 s12

X = (X11 x12

x12 x22)

S = (S11 s12

s12 s22)

71 / 74

Sparse Inverse Covariance

Block-coordinate ascent

Solve by maximizing one column of matrix at a time (edges for eachvariable, e.g.,   below)

Solution is then 

(This is l1­regularized least squares, easy to solve, not yucky at all)

x12

minβ

∥X1/211 β − z∥2 + λ∥β∥1

1

2

x12 = W11β

72 / 74

Sparse Inverse Covariance

Block-coordinate ascent

Solve by maximizing one column of matrix at a time (edges for eachvariable, e.g.,   below)

Iterate over columns of matrix until   converges (or even better, untilobjective function converges)

x12

minβ

∥X1/211 β − z∥2 + λ∥β∥1

1

2

β

73 / 74

SummaryUsing Gaussian Graphical Model representation

multivariate normal probability over a sparse graphtake resulting graph as e.g., gene network

Use sparsity­inducing regularization (l1­norm)

Block­coordinate ascent method leads to l1­regularized regression ateach step

Can use efficient coordinate descent (soft­thresholding) to solveregression problem

74 / 74

Statistical Analysis of Network DataHéctor Corrada Bravo

University of Maryland, College Park, USA CMSC828O 2019­10­21

Page 2: Statistical Analysis of Network Data

Statistical AnalysisIn this next unit we will look at methods that approach network analysisfrom a statistical inference perspective.

1 / 74

Page 3: Statistical Analysis of Network Data

Statistical AnalysisIn particular we will look at three statistical inference and learning tasksover networks

Analyzing edges between vertices as a stochastic process over whichwe can make statistical inferences

Constructing networks from observational data

Analyzing a process (e.g., diffusion) over a network in a statisticalmanner

2 / 74

Page 4: Statistical Analysis of Network Data

https://doi.org/10.1016/j.prevetmed.2014.01.013

Spatial e�ects in ecological networks

3 / 74

Page 5: Statistical Analysis of Network Data

Exponential Random Graph ModelsIn ER random graph model edge probabilities were independent ofvertex characterisitics.

Now assume vertices have measured attributes.

Question: what is the effect of these attributes in network formation,specifically in edge occurence.

4 / 74

Page 6: Statistical Analysis of Network Data

Exponential Random Graph ModelsDenote   as adjacency matrix of graph   over   elements

Denote   as matrix of vertex attributes.

We want to determine 

where   is a measure of structure of graph   and   is theconfiguration of edges other than edge 

Y G n

X

P(Yij = 1|Y−(ij) = y−(ij), xi, xj, L(G))

L(G) G y−(ij)

i ∼ j

5 / 74

Page 7: Statistical Analysis of Network Data

Exponential Random Graph ModelsWe can motivate ERGM model from regression (where outcome   iscontinuous)

Y

E[Yi|xi] =

p

∑j=1

βjxij = β ′xi

6 / 74

Page 8: Statistical Analysis of Network Data

Exponential Random Graph ModelsWe turn into a probabilistic model as

Y = β ′xi + ϵ

ϵ ∼ N(0, σ2)

7 / 74

Page 9: Statistical Analysis of Network Data

Exponential Random Graph ModelsFor binary outcome   we use logistic regression

Which corresponds to a Bernoulli model of  .

Y

log = β ′xi

P(Yi = 1|xi)

1 − P(Yi = 1|xi)

P(Yi = 1|xi)8 / 74

Page 10: Statistical Analysis of Network Data

Exponential Random Graph ModelsThe outcome of interest in the ERGM model is the presence of edge 

.

Use a Bernoulli model with   as the outcome.

With vertex attributes and graph structural measure as predictors.

yij = 1

yij

9 / 74

Page 11: Statistical Analysis of Network Data

Exponential Random Graph ModelsModel 1: the ER model

Thinking of logistic regression: model is a constant, independent of restof graph structure, independent of vertex attributes

log = θP(Yij = 1|Y−(ij) = y−(ij))

P(Yij = 0|Y−(ij) = y−(ij))

10 / 74

Page 12: Statistical Analysis of Network Data

Exponential Random Graph ModelsTo fit models we need a likelihood, i.e., probability of observed graph,given parameters (in this case  )θ

11 / 74

Page 13: Statistical Analysis of Network Data

Exponential Random Graph ModelsTo fit models we need a likelihood, i.e., probability of observed graph,given parameters (in this case  )

Write   as  , then likelihood is given by

θ

P(Yij = 1|. . . ) p

L(θ; y) = ∏ij

pyij(1 − p)(1−yij)

12 / 74

Page 14: Statistical Analysis of Network Data

Exponential Random Graph Models(Exercise)

where   is the number of edges in the graph.

This is the formulation given in reading!

L(θ; y) = exp{θL(y)}1

κ

L(y)

13 / 74

Page 15: Statistical Analysis of Network Data

## Observations: 36

## Variables: 9

## $ name <chr> "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", …

## $ Seniority <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…

## $ Status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

## $ Gender <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

## $ Office <int> 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 2, …

## $ Years <int> 31, 32, 13, 31, 31, 29, 29, 28, 25, 25, 23, 24, 22, 1,…

## $ Age <int> 64, 62, 67, 59, 59, 55, 63, 53, 53, 53, 50, 52, 57, 56…

## $ Practice <int> 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, …

## $ School <int> 1, 1, 1, 3, 2, 1, 3, 3, 1, 3, 1, 2, 2, 1, 3, 1, 1, 2, …

Exponential Random Graph Models

14 / 74

Page 16: Statistical Analysis of Network Data

library(ergm)

A <- get.adjacency(lazega)

lazega.s <- network::as.network(as.matrix(A), directed=FALSE)

ergm.bern.fit <- ergm(lazega.s ~ edges)

ergm.bern.fit

##

## MLE Coefficients:

## edges

## -1.499

Exponential Random Graph ModelsSo 

and thus 

θ = −1.5

p = 0.183

15 / 74

Page 17: Statistical Analysis of Network Data

Exponential Random Graph Models

16 / 74

Page 18: Statistical Analysis of Network Data

: number of  ­stars

Exponential Random Graph ModelsThe ER model is not appropriate, let's extend with more graph statistics.

: number of  ­trianglesSk(y) k Tk(y) k

17 / 74

Page 19: Statistical Analysis of Network Data

Exponential Random Graph ModelsIn practice, instead of adding terms for structural statistics at all values of, they are combined into a single term

For example alternating  ­star counts

 is a parameter that controls decay of influence of larger   terms. Treatas a hyper­parameter of model

k

k

AKSλ(y) =Nv−1

∑k=2

(−1)k Sk(y)

λk−2

λ k

18 / 74

Page 20: Statistical Analysis of Network Data

Exponential Random Graph ModelsAnother example is geometrically weighted degree count

There is a good amount of literature on definitions and properties ofsuitable terms to summarize graph structure in these models

GWDγ(y) =Nv−1

∑d=0

e−γdNd(y)

19 / 74

Page 21: Statistical Analysis of Network Data

Exponential Random Graph ModelsIn addition we want to adjust edge probabilities based on vertexattributes

For edge  ,   may have attribute that increases degree (e.g.,seniority)

Or,   and   have attributes that together increase edge probability (e.g.,spatial distance in an ecological network)

i ∼ j i

i j

20 / 74

Page 22: Statistical Analysis of Network Data

Exponential Random Graph ModelsWe can add attribute terms to the ERGM model accordingly. E.g.,

Main effects: Categorical interaction (match): Numeric interaction: 

h(xi, xj) = xi + xj

h(xi, xj) = I(xi == xj)

h(xi, xj) = (xi − xj)2

21 / 74

Page 23: Statistical Analysis of Network Data

Exponential Random Graph ModelsA full ERGM model for this data:

lazega.ergm <- formula(lazega.s ~ edges + gwesp(log(3), fixed=TRUE) +

nodemain("Seniority") +

nodemain("Practice") +

match("Practice") +

match("Gender") +

match("Office"))

22 / 74

Page 24: Statistical Analysis of Network Data

Exponential Random Graph Models## # A tibble: 1 x 5

## independence iterations logLik AIC BIC

## <lgl> <int> <dbl> <dbl> <dbl>

## 1 FALSE 2 -230. 474. 505.

23 / 74

Page 25: Statistical Analysis of Network Data

Exponential Random Graph Models

estimate std.error statistic p.value

­7.0065546 0.6711396 ­10.439787 0.0000000

0.5916556 0.0855375 6.916915 0.0000000

0.0245612 0.0061996 3.961761 0.0000744

0.3945545 0.1021836 3.861229 0.0001128

0.7696627 0.1906006 4.038093 0.0000539

0.7376656 0.2436241 3.027885 0.0024627

1.1643929 0.1875340 6.208970 0.0000000

24 / 74

Page 26: Statistical Analysis of Network Data

Exponential Random Graph Models

25 / 74

Page 27: Statistical Analysis of Network Data

Exponential Random Graph ModelsA few more points:

The general formulation for ERGM is

where   represents possible configurations of possible edges among asubset of vertices in graph

 if configuration   occurs in graph

Pθ(Y = y) = exp{∑H

gH(y)}1

κ

H

gH(y) = ∏yij∈H yij = 1 H

26 / 74

Page 28: Statistical Analysis of Network Data

Exponential Random Graph ModelsThis brings about some complications since it's infeasible to definefunction over all possible configurations

Instead, collapse configurations into groups based on certain properties,and count the number of times these properties are satisfied in graph

Even then, computing normalization term   is also infeasible, thereforeuse sampling methods (MCMC) for estimation

κ

27 / 74

Page 29: Statistical Analysis of Network Data

https://doi.org/10.1371/journal.pbio.1002527

Stochastic Block Models

28 / 74

Page 30: Statistical Analysis of Network Data

Stochastic Block ModelsMethod to cluster vertices in graph

Assume that each vertex belongs to one of   classes

Then probability of edge   depends on class/cluster membership ofvertices   and 

Q

i ∼ j

i j

29 / 74

Page 31: Statistical Analysis of Network Data

Stochastic Block Models

Clustered ERGM model

If we knew vertex classes, e.g.,   belongs to class   and   belongs toclass 

i q j

r

log = θqr

P(Yij = 1|Y−(ij) = y−(ij))

P(Yij = 0|Y−(ij) = y−(ij))

30 / 74

Page 32: Statistical Analysis of Network Data

Stochastic Block Models

Clustered ERGM model

Likelihood is then

with   the number of edges   where   in class   and   in class 

(a model like g~match(class) in ERGM)

L(θ; y) = exp{∑qr

θqrLqr(y)}1

κ

Lqr(y) i ∼ j i q j

r

31 / 74

Page 33: Statistical Analysis of Network Data

Stochastic Block ModelsHowever, suppose we don't know vertex class assignments...

SBM is a probabilistic method where we maximize likelihood of thismodel, assuming class assignments are unobserved

32 / 74

Page 34: Statistical Analysis of Network Data

Stochastic Block Models

33 / 74

Page 35: Statistical Analysis of Network Data

Stochastic Block Models edge   (binary) indicator for vertex   class   (binary) prior for class   : probability of edge   where   in class   and   in class 

Yij i ∼ j

ziq i q

αq q p(ziq = 1) = αq

πqr i ∼ j i q j r

34 / 74

Page 36: Statistical Analysis of Network Data

Stochastic Block ModelsWith this we can write again a likelihood

with 

L(θ; y, z) = ∑i

∑q

ziq log αq + ∑i≠j

∑q≠r

ziqzjrb(yij; πqr)1

2

b(y, π) = πy(1 − π)(1−y)

35 / 74

Page 37: Statistical Analysis of Network Data

Stochastic Block ModelsLike similar models (e.g., Gaussian mixture model, Latent DirichletAllocation) can't optimize this directly

Instead EM algorithm used:

Initialize parameters Repeat until "convergence":

Compute Maximize likelihood w.r.t.   plugging in   for  .

θ

γiq = E{ziq|y; θ} = p(ziq|y; θ)

θ γiq ziq

36 / 74

Page 38: Statistical Analysis of Network Data

Stochastic Block ModelsLike similar models need to determine number of classes (clusters) andselect using some model selection criterion

AICBICIntegrated Classification Likelihood

37 / 74

Page 39: Statistical Analysis of Network Data

Stochastic Block Models

38 / 74

Page 40: Statistical Analysis of Network Data

CommunitiesIf we think of class as community we can see relationship with non­probabilistically community finding methods (e.g., Newman­Girvan)

See http://www.pnas.org/content/106/50/21068.full for comparison ofthese methods

39 / 74

Page 41: Statistical Analysis of Network Data

SummarySlightly different way of thinking probabilistically about networks

Define probabilistic model over network configurations

Parameterize model using network structural properties and vertexproperties

Perform inference/analysis on resulting parameters

Can also extend classical clustering methodology to this setting

40 / 74

Page 42: Statistical Analysis of Network Data

Learning Network StructureHow to find network structure from observational data

Gene Co­expression

41 / 74

Page 43: Statistical Analysis of Network Data

Learning Network StructureHow to find network structure from observational data

Functional Connectivity

42 / 74

Page 44: Statistical Analysis of Network Data

Learning Network Structure

Correlation Networks

The simplest approach: compute correlation between observations, ifcorrelation high, add an edge

43 / 74

Page 45: Statistical Analysis of Network Data

Learning Network StructureAssume data   (e.g., gene expression of gene in   different conditions) and 

Important quantity 1: the covariance of   and 

yi = {yi1, yi2, … , yiT } i

T yj

yi yj

σij =T

∑t=1

(yit − ¯̄̄y i)(yjt − ¯̄̄y j)1

T

44 / 74

Page 46: Statistical Analysis of Network Data

Learning Network StructureImportant quantity 1: the covariance of   and 

How do   and   vary around their means?

yi yj

σij =T

∑t=1

(yit − ¯̄̄y i)(yjt − ¯̄̄y j)1

T

yi yj

45 / 74

Page 47: Statistical Analysis of Network Data

Learning Network StructureWe can estimate   from data by plugging in the mean of   and  .

We would notate the estimate as  .

In the following,   often means  , it should follow from context.

σij yi yj

σ̂ij

σij σ̂ij

46 / 74

Page 48: Statistical Analysis of Network Data

Learning Network StructureWe often need to compare quantities across different entities in system,e.g., genes, so we want to remove scale

Pearson's Correlation:

With   the standard deviation of  :

ρij =σij

σiiσjj

σii yi

σii =

T

∑i=1

(yit − ¯̄̄y i)1

T47 / 74

Page 49: Statistical Analysis of Network Data

Learning Network StructureNote Pearson Correlation is between ­1 and 1, it is hard to performinference on bounded quantities, so one more transformation.

Fisher's transformation

zij = tanh−1(ρij) = log1

2

1 + ρij

1 − ρij

48 / 74

Page 50: Statistical Analysis of Network Data

Edge inference: hypothesistest

Compute  ­value   from 

Learning Network Structure

H0 = ρij = 0 HA = ρij ≠ 0

P pij

N(0,√1/(T − 3))

49 / 74

Page 51: Statistical Analysis of Network Data

Learning Network StructurePerform hypothesis test for every pair of entities, i.e., possible edge 

We would compute  ­value for each possible edge

When performing many independent tests,  ­values no longer have ourintended interpretation

i j

P

P

50 / 74

Page 52: Statistical Analysis of Network Data

Learning Network Structure

Multiple Hypothesis Testing

Called Significant Not Called Significant Total

Null True

Altern. True

Total

Note:   total tests

V m0 − V m0

S m1 − S m1

R m − R m

m

51 / 74

Page 53: Statistical Analysis of Network Data

Learning Network Structure

Error rates

Family­wise error rate (FWER): the probability of at least one Type Ierror (false positive) 

We use Bonferroni procedure to control FWER.

If testing at level   (e.g.,  ), only include egdes for which  ­value 

FWER = Pr(V ≥ 1)

α α = 0.05 P

pij ≤ α/m

52 / 74

Page 54: Statistical Analysis of Network Data

Learning Network Structure

Error rates

False Discovery Rate (FDR): rate that false discoveries occur 

We use Benjamini­Hochberg procedure to control FDR.

Construct list of edges at FDR level   (e.g.  ) if  ,where   is the  ­value for the  ­th largest  ­value.

Note: there are other more precise FDR controlling procedures (esp.  ­values)

FDR = E(V /R; R > 0)Pr(R > 0)

β β = 0.1 p(k) ≤ βkm

p(k) p k p

q

53 / 74

Page 55: Statistical Analysis of Network Data

Learning Network Structure

The problem with Pearson's correlation

Consider the following networks, where absence of edge corresponds totrue conditional independence between vertices in graph

In all three of these, Pearson's correlation test with   is likelystatistically significant.

ρij

54 / 74

Page 56: Statistical Analysis of Network Data

Learning Network StructureLet's extend the way we think about the situation. First considercovariance matrix for  ,   and i j k

Σ =

⎛⎜ ⎜⎝

σ2

ii σij σik

σji σ2

jj σjk

σki σkj σ2

kk

⎞⎟ ⎟⎠

55 / 74

Page 57: Statistical Analysis of Network Data

Learning Network StructureWe can then think about the covariance of   and   conditioned on 

How do   and   co­vary around their conditional means   and 

i j k

Σij|k = (σ2

ii σij

σji σ2jj

) − σ−2kk (

σ2ik

σikσjk

σikσjk σ2jk

)

yi yj E(yi|yk)

E(yj|yk)

56 / 74

Page 58: Statistical Analysis of Network Data

Learning Network Structure

Partial correlation networks

This leads to the concept of partial correlation (which we can derive fromthe conditional covariance)

ρij|k =ρij − ρikρjk

√(1 − ρ2ik

)√(1 − ρ2jk

)

57 / 74

Page 59: Statistical Analysis of Network Data

Learning Network Structure

Partial correlation networks

What's the test now? No edge if   and   are conditionally independent(there is some   such that  )

Formally:

i j

k ρij|k = 0

H0 : ρij|k = 0 for some k ∈ V∖{i,j}

HA : ρij|k ≠ 0 for all k

58 / 74

Page 60: Statistical Analysis of Network Data

Learning Network Structure

Partial correlation networks

To determine edge   compute  ­value   as

where   is a  ­value computed from (transformed) partial correlation 

Use multiple testing correction as before

i ∼ j P pij

pij = max{pij|k : k ∈ V∖{i,j}}

pij|k P

ρij|k

59 / 74

Page 61: Statistical Analysis of Network Data

Learning Network Structure

Problems with partial correlation networks

For every edge, must compute partial correlation wrt. every other vertex

Compound hypothesis tests like the above are harder to control formultiple testing (i.e., correction mentioned above is not quite right)

The dependence structure they represent is unclear

60 / 74

Page 62: Statistical Analysis of Network Data

Learning Network StructureHere we turn to a very powerful abstraction, thinking of graphs as a wayof describing the joint distribution of gene expression measurements(Probabilistic Graphical Models).

61 / 74

Page 63: Statistical Analysis of Network Data

Graphical ModelsConsider each complete vector of expression measurements at eachtime 

Suppose some conditional independence properties hold for somevariables in  ,

Example: variable   and   are independent given remainingvariables in 

y

y

y2 y3

y

62 / 74

Page 64: Statistical Analysis of Network Data

Graphical ModelsWe can encode these conditional independence properties in a graph.

63 / 74

Page 65: Statistical Analysis of Network Data

Graphical ModelsHammersley­Clifford theorem: all probability distributions that satisfyconditional independence properties in a graph can be written as

 is the set of all cliques in a graph,   a specific clique and   thevariables in the clique.

P(y) = exp{∑c∈C

fc(yc)}1

Z

C c yc

64 / 74

Page 66: Statistical Analysis of Network Data

Graphical ModelsThe probability distribution is determined by the choice of potentialfunctions  . Example:

1. 2. 3.   for 

fc

fc(yi) = − τiiy2i

12

fc({yi, yj}) = − τijyiyj12

fc(yc) = 0 |yc| ≥ 3

65 / 74

Page 67: Statistical Analysis of Network Data

1.   if there is an edge

between   and 

2.   otherwise

Graphical Gaussian ModelsDefine matrix   asΣ

−1

Σ−1

ij= τij

yi yj

Σ−1

ij= 0

Σ−1

=

⎛⎜ ⎜ ⎜ ⎜ ⎜⎝

τ 2

11τ12 0 τ14

τ12 τ 2

22τ23 0

0 τ23 τ 2

33τ34

τ14 0 τ34 τ 2

44

⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ 66 / 74

Page 68: Statistical Analysis of Network Data

Graphical Gaussian ModelsWith this in place, we can say that   is distributed as multivariate normaldistribution  .

Connection to partial correlation: We can think about distribution of and   conditioned on the rest of the graph   and the (partial)correlation of   and   under this distribution

y

N(0, Σ)

yi

yj V∖{i,j}

yi yj

ρij|V∖{i,j}= −

τij

τiiτjj

67 / 74

Page 69: Statistical Analysis of Network Data

Banerjee, et al. ICML 2006, JMLR 2008, Friedman Biostatistics 2007

Sparse Inverse CovarianceWith this framework in place we can now think of network structureinference.

Main idea: given draws from multivariate distribution   (i.e., expressionvector at each time point),

Estimate a sparse inverse correlation matrix, get graph from the patternof 0's in the estimated matrix

y

68 / 74

Page 70: Statistical Analysis of Network Data

Sparse Inverse CovarianceMaximum Likelihood estimate of inverse covariance is given by solutionto

 is the estimated sample covariance matrix

(Yuck)

maxX≻0

log det X − (SX)

S

S =T

∑t=1

ytyt′

69 / 74

Page 71: Statistical Analysis of Network Data

Sparse Inverse CovarianceWe can induce zeros in the solution using a penalized likelihoodestimate

where

(Yuckier)

maxX≻0

log det X − (SX) − λ∥X∥1

∥X∥1 = ∑ij

Xij

70 / 74

Page 72: Statistical Analysis of Network Data

Sparse Inverse Covariance

Block-coordinate ascent

Solve by maximizing one column of matrix at a time (edges for eachvariable, e.g.,   below)

with   and

x12

minβ

∥X1/211 β − z∥2 + λ∥β∥1

1

2

z = W1/2

11 s12

X = (X11 x12

x12 x22)

S = (S11 s12

s12 s22)

71 / 74

Page 73: Statistical Analysis of Network Data

Sparse Inverse Covariance

Block-coordinate ascent

Solve by maximizing one column of matrix at a time (edges for eachvariable, e.g.,   below)

Solution is then 

(This is l1­regularized least squares, easy to solve, not yucky at all)

x12

minβ

∥X1/211 β − z∥2 + λ∥β∥1

1

2

x12 = W11β

72 / 74

Page 74: Statistical Analysis of Network Data

Sparse Inverse Covariance

Block-coordinate ascent

Solve by maximizing one column of matrix at a time (edges for eachvariable, e.g.,   below)

Iterate over columns of matrix until   converges (or even better, untilobjective function converges)

x12

minβ

∥X1/211 β − z∥2 + λ∥β∥1

1

2

β

73 / 74

Page 75: Statistical Analysis of Network Data

SummaryUsing Gaussian Graphical Model representation

multivariate normal probability over a sparse graphtake resulting graph as e.g., gene network

Use sparsity­inducing regularization (l1­norm)

Block­coordinate ascent method leads to l1­regularized regression ateach step

Can use efficient coordinate descent (soft­thresholding) to solveregression problem

74 / 74