Probabilistic Graphical Modelsepxing/Class/10708-17/slides/lecture10...School of Computer Science...

School of Computer Science

Probabilistic Graphical Models

Gaussian graphical models and Ising models: modeling networks

Eric XingLecture 10, February 20, 2017

Reading: See class website© Eric Xing @ CMU, 2005-2017 1

Social Network Internet Regulatory Network

Network Research

Where do networks come from?l The Jesus network

Evolving networks

March 2005 January 2006 August 2006

Can I get his vote?

Corporativity, Antagonism,

Cliques,…

over time?

t=1 2 3 T

Evolving networks

Recall: ML Structural Learning for completely observed

),,( )()( 111 nxx !

),,( )()( Mn

M xx !1

!),,( )()( 22

1 nxx !

Two “Optimal” approaches l “Optimal” here means the employed algorithms guarantee to

return a structure that maximizes the objectives (e.g., LogLik)l Many heuristics used to be popular, but they provide no guarantee on attaining

optimality, interpretability, or even do not have an explicit objectivel E.g.: structured EM, Module network, greedy structural search, etc.

l We will learn two classes of algorithms for guaranteed structure learning, which are likely to be the only known methods enjoying such guarantee, but they only apply to certain families of graphs:l Trees: The Chow-Liu algorithm (this lecture)l Pairwise MRFs: covariance selection, neighborhood-selection (later)

x4⇓⇓

Key Idea: network inference as parameter estimation

l Nodal states can be either discrete (Ising/Potts model), or continuous (Gaussian graphical model), or heterogeneous

l the parameter matrix encodes the graph structure

9© Eric Xing @ CMU, 2005-2017

Model: Pairwise Markov Random Fields

Recall Multivariate Gaussianl Multivariate Gaussian density:

l WOLG: let

l We can view this as a continuous Markov Random Field with potentials defined on every node and edge:

{ })-()-(-exp)(

),|( //µµ

πµ xxx 1

21221 −ΣΣ

=Σ Tn

( )⎭⎬⎫

⎩⎨⎧

−== ∑∑<

iiinp xxqxqQ

Qxxxp 221

21 -exp)2(

),0|,,,(π

Cell type

Microarray samples Encodes dependencies

among genes

Gaussian Graphical Model

Edge corresponds to non-zero precision matrix element

Precision Matrix Encodes Non-ZeroEdges in Gaussian Graphical Modela

� Correlation network is based on Covariance Matrix

� A GGM is a Markov Network based on Precision Matrix� Conditional Independence/Partial Correlation Coefficients

are a more sophisticated dependence measure

With small sample size, empirical covariance matrix cannot be inverted

Markov versus Correlation Network

Sparsityl One common assumption to make: sparsity

l Makes empirical sense: Genes are only assumed to interface with small groups of other genes.

l Makes statistical sense: Learning is now feasible in high dimensions with small sample size

sparse

Network Learning with the LASSO

l Assume network is a Gaussian Graphical Model

l Perform LASSO regression of all nodes to a target node

l LASSO can select the neighborhood of each node

L1 Regularization (LASSO)l A convex relaxation.

l Enforces sparsity!

Constrained Form Lagrangian Form

……

ββ1ββ2

ββn-1

Theoretical Guaranteesl Assumptions

l Dependency Condition: Relevant Covariates are not overly dependentl Incoherence Condition: Large number of irrelevant covariates cannot be too

correlated with relevant covariatesl Strong concentration bounds: Sample quantities converge to expected values

quickly

If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant.

l Repeat this for every nodel Form the total edge set

Then with high probability,

Consistent Structure Recovery[Meinshausen and Buhlmann 2006, Wainwright 2009]

Why this algorithm work?l What is the intuition behind graphical regression?

l Continuous nodal attributesl Discrete nodal attributes

l Are there other algorihtms?

l More general scenarios: non-iid sample and evolving networks

l Case study

Multivariate Gaussianl Multivariate Gaussian density:

l A joint Gaussian:

l How to write down p(x2), p(x1|x2) or p(x2|x1) using the block elements in µ and Σ?l Formulas to remember:

{ })-()-(-exp)(

),|( //µµ

πµ xxx 1

21221 −ΣΣ

=Σ Tn

) , (),|( ⎥⎦

⎤⎢⎣

⎡ΣΣ

ΣΣ⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=Σ⎥

⎤⎢⎣

22121121

2212121

2121121

ΣΣΣ−Σ=

−ΣΣ+=

),|()(

Vmxxµ

),|()( N

The matrix inverse lemmal Consider a block-partitioned matrix:

l First we diagonalize M

l Schur complement:

l Then we inverse, using this formula:

l Matrix inverse lemma

⎥⎦

⎤⎢⎣

⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

HGE-FH

-FHI -

XZW YW XYZ -- 11 =⇒=

GE-FHM/H -1=

( ) ( )( ) ( )

( ) ( )( ) ( ) ⎥

⎤⎢⎣

⎡ +=⎥

⎤⎢⎣

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

−−

−−−−−

−−

111111

M/EGEM/E-M/EF-EGEM/EFEE

FHM/HGHHM/HG-HFHM/H-M/H

( ) ( ) 1111111 --- GEFH-GEFEEGE-FH −−−−+=

The covariance and the precision matrices

( ) ( )( ) ( ) ⎥

⎤⎢⎣

−−−−−

−−−

111111

FHM/HGHHM/HG-HFHM/H-M/HM

( ) ⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

Σ+ΣΣ

−−

111111

1111111

qI-q-qqQ

σσσσ

⎥⎦

⎤⎢⎣

σσσ

⇓⇓

Single-node Conditional l The conditional dist. of a single node i given the rest of the

nodes can be written as:

l WOLG: let

22121121

2212121

2121121

ΣΣΣ−Σ=

−ΣΣ+=

),|()(

( ) ⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

Σ+ΣΣ

−−

111111

1111111

qI-q-qqQ

σσσσ

Conditional auto-regression l From

l We can write the following conditional auto-regression function for each node:

l Neighborhood est. based on auto-regression coefficient

Conditional independencel From

l Given an estimate of the neighborhood si, we have:

l Thus the neighborhood si defines the Markov blanket of node i

Recent trends in GGM:l Covariance selection (classical

method) l Dempster [1972]:

l Sequentially pruning smallest elements in precision matrix

l Drton and Perlman [2008]: l Improved statistical tests for

pruning

l L1-regularization based method (hot !)l Meinshausen and Bühlmann [Ann.

Stat. 06]: l Used LASSO regression for

neighborhood selectionl Banerjee [JMLR 08]:

l Block sub-gradient algorithm for finding precision matrix

l Friedman et al. [Biostatistics 08]: l Efficient fixed-point equations

based on a sub-gradient algorithm

Serious limitations in practice: breaks down when covariance matrix is not invertible

Structure learning is possible even when # variables ＞ # samples

The Meinshausen-Bühlmann (MB) algorithm:

Step 1: Pick up one variable

Step 2: Think of it as “y”, and the rest as “z”

Step 3: Solve Lasso regression problem between y and z

Step 4: Connect the k-th node to those having nonzero weight in w

l Solving separated Lasso for every single variables:

The resulting coefficient does not correspond to the Q value-wise

L1-regularized maximum likelihood learning

l Input: Sample covariance matrix S

l Assumes standardized data (mean=0, variance=1)l S is generally rank-deficient

l Thus the inverse does not exist

l Output: Sparse precision matrix Ql Originally, Q is defined as the inverse of S, but not directly invertiblel Need to find a sparse matrix that can be thought as of as an inverse of S

l Approach: Solve an L1-regularized maximum likelihood equation

log likelihood regularizer

From matrix opt. to vector opt.:coupled Lasso for every single Var.

l Focus only on one row (column), keeping the others constant

l Optimization problem for blue vector is shown to be Lasso (L1-regularized quadratic programming)

l Difference from MB’s: Resulting Lasso problems are coupledl The gray part is actually not constant; changes after solving one Lasso problem

(because it is the opt of the entire Q that optimize a single loss function, whereas in MB each lasso has its own loss function..

l This coupling is essential for stability under noise

Learning Ising Model (i.e. pairwise MRF)

l Assuming the nodes are discrete (e.g., voting outcome of a person), and edges are weighted, then for a sample x, we have

l It can be shown the pseudo-conditional likelihood for node k is

Question: vector-valued nodes

New Problem: Evolving Social Networks

Can I get his vote?

Corporativity,

Antagonism,Cliques,…over time?

Drosophila development

n=1 or some small #

Reverse engineering time-specific "rewiring" networks

Inference Il KELLER: Kernel Weighted L1-regularized Logistic Regression

l Constrained convex optimizationl Estimate time-specific nets one by one, based on "virtual iid" samplesl Could scale to ~104 genes, but under stronger smoothness assumptions

[Song, Kolar and Xing, Bioinformatics 09]

Lasso:

l Conditional likelihood

l Neighborhood Selection:

l Time-specific graph regression:l Estimate at

Algorithm – nonparametric neighborhood selection

Structural consistency of KELLER

Assumptionsl Define:

l A1: Dependency Condition

l A2: Incoherence Condition

l A3: Smoothness Condition

l A4: Bounded Kernel

Theorem [Kolar and Xing, 09]

l TESLA: Temporally Smoothed L1-regularized logistic regression

l Constrained convex optimizationl Scale to ~5000 nodes, does not need smoothness assumption, can

accommodate abrupt changes.

Inference II [Amr and Xing, PNAS 2009, AOAS 2009]

Temporally Smoothed Graph Regression

TESLA:

Modified estimation procedurel estimate block partition on which the coefficient functions

are constant

l estimate the coefficient functions on each block of the partition

Structural Consistency of TESLA

I. It can be shown that, by applying the results for modelselection of the Lasso on a temporal difference transformation of (*), the block are estimated consistently

II. Then it can be further shown that, by applying Lasso on (**), the neighborhood of each node on each of the estimated blocks consistently

l Further advantages of the two step procedurel choosing parameters easierl faster optimization procedure

[Kolar, and Xing, 2009]

Senate network – 109th congress

l Voting records from 109th congress (2005 - 2006)l There are 100 senators whose votes were recorded on the

542 bills, each vote is a binary outcome

Senate network – 109th congress

Senator Chafee

Senator Ben Nelson

T=0.2 T=0.8

S1 (normal)

Progression and Reversion of Breast Cancer cells

T4 (malignant)

T4 revert 1

T4 revert 2

T4 revert 3

Estimate Neighborhoods Jointly Across All Cell Types

How to share information across the cell types?

Penalize differences between networks of adjacent cell types

Sparsity of Difference

RSS for all cell types

sparsity Sparsity of difference

Tree-Guided Graphical Lasso (Treegl)

EGFR-ITGB1

PI3K-MAPKK

Network Overview

S1 cellsT4 cells: Increased Cell Proliferation, Growth, Signaling, Locomotion

Interactions – Biological Processes

MMP-T4R cells: Significantly reduced interactions

T4 cells

PI3K-MAPKK-T4R: Reduced Growth, Locomotion and SignalingT4 cells

Fancier network est. scenariosl Dynamic Directed (auto-regressive) Networks

l Missing Data

l Multi-attribute Data

[Song, Kolar and Xing, NIPS 2009]

[Kolar and Xing, ICML 2012]

[Kolar, Liu and Xing, JMLR 2013]

Summaryl Graphical Gaussian Model

l The precision matrix encode structurel Not estimatable when p >> n

l Neighborhood selection:l Conditional dist under GGM/MRFl Graphical lassol Sparsistency

l Time-varying Markov networksl Kernel reweighting est.l Total variation est.

Probabilistic Graphical Modelsepxing/Class/10708-17/slides/lecture10...School of Computer Science...

Documents

Transcript of Probabilistic Graphical Modelsepxing/Class/10708-17/slides/lecture10...School of Computer Science...

Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture01... · Logistics q 4 homework assignments: 50% of grade q Theory exercises, Implementation exercises q Scribe

Probabilistic Graphical Modelsepxing/Class/10708-16/slide/... · Logistics 5homework assignments: 40% of grade Theory exercises, Implementation exercises Scribe duties: 10% (~once

Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/... · I. LDA: Latent Dirichlet Allocation Joint Distribution: Variational Inference with : Minimize the variational bound

Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture1-Introduction.pdf · Logistics Text books: Daphne Koller and Nir Friedman, Probabilistic Graphical Models M. I.

Probabilistic Graphical Modelsepxing/.../lecture25-RegBayes.pdf · Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat.

Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture8-EM.… · Unobserved Variables A variable can be unobserved (latent) because: it is an imaginary quantity meant

Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture21-IBP.pdf · Probabilistic Graphical Models Infinite Feature Models: The Indian Buffet Process Eric Xing ... If

Probabilistic Graphical Modelsassets.disi.unitn.it/uploads/doctoral_school/... · Introduction Probabilistic Graphical Models Need: probabilistic models that can represent models

Probabilistic Graphical Modelsepxing/Class/10708/lectures/lecture2-BN... · Representation: what is the joint probability dist. on multiple variables? How many state configurations

Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture8...non-maximal clique potentials If the graph is non-decomposable, and or the potentials are defined on non-maximal

Introduction to Probabilistic Graphical Models · This tutorial provides an introduction to probabilistic graphical models. We review three rep-resentations of probabilistic graphical

Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture26-Spam.p… · Parametric vs. non-parametric Locally weighted linear regression is another example we are running

Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture05... · 2020. 1. 29. · Learning Graphical Models q Scenarios: q completely observed GMs q directed q undirected

Probabilistic Graphical Models - ttic.

Probabilistic Graphical Modelsepxing/Class/10708/lectures/lecture28-M3N.pdfStructured Prediction Graphical Models Challenges: • SPARSE “Interpretable” prediction model • Prior

Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-13/lecture/lecture10-NetworkLearning.pdf7 Theoretical Guarantees zAssumptions zDeppyendency Condition:

PROBABILISTIC GRAPHICAL MODELS - pudn.comread.pudn.com/downloads705/ebook/2833570/Probabilistic Graphica… · daphne koller and nir friedman probabilistic graphical models principles