Variational Methods for Graphical Models

Variational Methods for Graphical Models

Micheal I. JordanZoubin GhahramaniTommi S. JaakkolaLawrence K. Saul

Presented by: Afsaneh Shirazi

2

Outline

• Motivation• Inference in graphical models• Exact inference is intractable• Variational methodology

– Sequential approach– Block approach

• Conclusions

3

Motivation(Example: Medical Diagnosis)

symptoms

diseases

What is the most probable disease?

4

Motivation

• We want to answer some queries about our data

• Graphical model is a way to model data• Inference in some graphical models is

intractable (NP-hard)• Variational methods simplify the inference

in graphical models by using approximation

5

Graphical Models

• Directed (Bayesian network)

• Undirected

S1

S3

S5

S4

S2P(S2)

P(S1)

P(S5|S3,S4)

P(S3|S1,S2) P(S4|S3)

(C1)

(C2)

(C3)

6

Inference in Graphical Models

Inference: Given a graphical model, the process of computing answers to queries

• How computationally hard is this decision problem?

• Theorem: Computing P(X = x) in a Bayesian network is NP-hard

7

Why Exact Inference is Intractable?

symptoms

diseases

Diagnose the most probable disease

8


symptoms

diseases

: Observed symptoms

)()|(),( dPdfPdfP f

9


symptoms

diseases:Noisy-OR model)|( dfP i

101

10


symptoms

diseases :Noisy-OR model)|( dfP i

101

))1,0,1(|0( ifP

11


)( 0

)( 0

1)|1(

)1()1()|0()(

0

ij ijij

ij ijij

j

d

i

d

ij

dijii

edfP

e

qqdfP

12


symptoms

diseases

: Observed symptoms

jj

ii dPdfP

dPdfPdfP

)()|(

)()|(),(f

j jjkij kijjkiij ijji ddd

eee*0

*)( 0)0( 000 ...

13


symptoms

diseases

: Observed symptoms

jj

ii dPdfP

dPdfPdfP

)()|(

)()|(),(f

)1(...)1( )( 0)0( 000

kij kijjkiij ijji ddee

14

Reducing the Computational Complexity

Variational Methods

Simple graph for exact methods

Approximate the probability

distribution

Use the role of convexity

15

Express a Function Variationally

• is a concave function)ln(x

))((min )ln(

Hxx

))ln((min )( xxHx

16


• is a concave function)ln(x

)1)ln((min )ln(

xx

17


• If the function is not convex or concave: transform the function to a desired form

• Example: logistic function

xexf

11 )( ))((min

)(

Hx

exf

))(ln()( xfxg ))((min)(

Hxxg

Transformation

Approximation

Transforming back

18

Approaches to Variational Methods

• Sequential Approach: (on-line) nodes are transformed in an order, determined during inference process

• Block Approach: (off-line) has obvious substructures

19

Sequential Approach(Two Methods)

Untransformed Graph

Transform one node at a time

Simple Graph for exact methods

Reintroduce one node at a time

Simple Graph for exact methods

Completelytransformed

Graph

20

Sequential Approach (Example)

)( 01)|1( ij ijijd

i edfP

symptoms

diseases

Log Concave

21


)( 01)|1( ij ijijd

i edfP

symptoms

diseases

Log Concave

)(1 fxx ee

)(

)( ][)|1( 0

ij

dfi

jijiiii eedfP

22


symptoms

diseases

)(

)( ][)|1( 0

ij

dfi

jijiiii eedfP

1

434)1( 3edP

)0( 3 dP

23


symptoms

diseases

)(

)( ][)|1( 0

ij

dfi

jijiiii eedfP

1

24


symptoms

diseases

)(

)( ][)|1( 0

ij

dfi

jijiiii eedfP

1

25

Sequential Approach (Upper Bound and Lower Bound)

• We need both lower bound and upper bound

),(),(),(

)|(jj

jj dfPdfP

dfPfdP

)),(()),(()),((

)|(jj

jj dfPLBdfPUB

dfPUBfdP

26

How to Compute Lower Bound for a Concave Function?

• Lower bound for concave functions:

j j

jj

j j

jj

jj

qz

afq

qz

qafzaf

)(

)()(

Variational parameter is probability distribution

jq

27

Block Approach (Overview)

• Off-line application of sequential approach– Identify some structure amenable to exact

inference– Family of probability distribution via

introduction of parameters– Choose best approximation based on

evidence

28

Block Approach (Details)

• KL divergence

}{ )()(ln)()||(

S SPSQSQPQD

)|( EHP

),|( EHQFamily of

),|( *EHQ

Minimize KL divergence

29

Block Approach (Example – Boltzmann machine)

ZeSP

ji i iijiij SSS

0

)|(

Si

Sj

ij

30


ZeSP

ji i iijiij SSS

0

)|(

Si

Sj=1

1ij

Ej

jijici S 00

31


si

sj

c

SSS

ZeEHP

ji i icijiij

0

),|(

ii Si

Hi

SiEHQ

1)1(),|(

i

j

32


si

sji

j

Minimize KL Divergence

j

ijiji )( 0

xex

11 )(

33


si

sji

j

Minimize KL Divergence

j

ijiji )( 0

Mean field equations: solve for fixed point

34

Conclusions

• Time or space complexity of exact calculation is unacceptable

• Complex graphs can be probabilistically simple

• Inference in simplified models provides bounds on probabilities in the original model

36

Extra Slides

37

Concerns

• Approximation accuracy• Strong dependencies can be identified• Not based on convexity transformation• Not able to assure that the framework will

transfer to other examples• Not straightforward to develop a

variational approximation for new architectures

38

Justification for KL Divergence

• Best lower bound on the probability of the evidence

}{

}{

}{

)|(),(ln)|(

)|(),()|(ln

),(ln)(ln

H

H

H

EHQEHPEHQ

EHQEHPEHQ

EHPEP

)(EP

39

EM

• Maximum likelihood parameter estimation:

• Following function is the lower bound on log likelihood

)|( EP

)|(ln)|()|,(ln)|(),(}{

EHQEHQEHPEHQQLH

),()|(ln QLEP KL Divergence between Q(H|E) and P(H|E,)

40

EM

1. Maximize the bound with respect to Q

2. Fix Q, maximize with respect to

),(maxarg :step) (E )()1( kQ

k QLQ

),(maxarg :step) (M )1()1( kk QL

),|( )(kEHP

Traditional EMApproximation to EM algorithm

41

Principle of InferenceDAG

Junction Tree

Inconsistent Junction TreeInitialization

Consistent Junction TreePropagation

)|( eEvVPMarginalization

42

Example: Create Join Tree

X1 X2

Y1 Y2

HMM with 2 time steps:

Junction Tree:

X1,X2X1,Y1 X2,Y2X1 X2

43

Example: Initialization

Variable Associated Cluster

Potential function

X1 X1,Y1

Y1 X1,Y1

X2 X1,X2

Y2 X2,Y2

X1,Y1 P(X1)

X1,Y1 P(X1)P(Y1 | X1)

X1,X 2 P(X2 | X1)

X 2,Y 2 P(Y2 | X2)

X1,X2X1,Y1 X2,Y2X1 X2

44

Example: Collect Evidence

• Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected.

• Call recursively neighboring cliques for messages:

• 1. Call X1,Y1.– 1. Projection:

– 2. Absorption:

X1 X1,Y1 P(X1,Y1)P(X1)Y1

{X1,Y1} X1

X1,X 2 X1,X 2X1

X1old P(X2 | X1)P(X1)P(X1,X2)

45

Example: Collect Evidence (cont.)

• 2. Call X2,Y2:– 1. Projection:

– 2. Absorption:

X 2 X 2,Y 2 P(Y2 | X2)1Y 2

{X 2,Y 2} X 2

X1,X2X1,Y1 X2,Y2X1 X2

X1,X 2 X1,X 2X 2

X 2old P(X1,X2)

46

Example: Distribute Evidence

• Pass messages recursively to neighboring nodes

• Pass message from X1,X2 to X1,Y1:– 1. Projection:

– 2. Absorption:

X1 X1,X 2 P(X1,X2)P(X1)X 2

{X1,X 2} X1

X1,Y1 X1,Y1X1

X1old P(X1,Y1) P(X1)

P(X1)

47

Example: Distribute Evidence (cont.)

• Pass message from X1,X2 to X2,Y2:– 1. Projection:

– 2. Absorption:

X 2 X1,X 2 P(X1,X2)P(X2)X1

{X1,X 2} X 2

X 2,Y 2 X 2,Y 2X 2

X 2old P(Y2 | X2) P(X2)

1P(Y2,X2)

X1,X2X1,Y1 X2,Y2X1 X2

48

Example: Inference with evidence

• Assume we want to compute: P(X2|Y1=0,Y2=1) (state estimation)

• Assign likelihoods to the potential functions during initialization:

X1,Y1 0 if Y11

P(X1,Y10) if Y10

X 2,Y 2 0 if Y20

P(Y21 | X2) if Y21

49

Example: Inference with evidence (cont.)

• Repeating the same steps as in the previous case, we obtain:

X1,Y1 0 if Y11

P(X1,Y10,Y21) if Y10

X1 P(X1,Y10,Y21)X1,X 2 P(X1,Y10,X2,Y21)X 2 P(Y10,X2,Y21)

X 2,Y 2 0 if Y20

P(Y10,X2,Y21) if Y21

50

Variable EliminationGeneral idea:• Write query in the form

• Iteratively– Move all irrelevant terms outside of innermost sum– Perform innermost sum, getting a new term– Insert the new term into the product

}\{

)|(),(nxX i

iin paxPXP e

51

x

kxkx yyxfyyf ),,,('),,( 11

m

ilikx i

yyxfyyxf1

,1,1,11 ),,(),,,('

Complexity of variable elimination

• Suppose in one elimination step we compute

This requires • multiplications

• additions

Complexity is exponential in number of variables in the intermediate factor

i

iYXm )Val()Val(

i

iYX )Val()Val(

52

Chordal Graphs

• elimination ordering undirected chordal graph

Graph:• Maximal cliques are factors in elimination• Factors in elimination are cliques in the graph• Complexity is exponential in size of the largest

clique in graph

LT

A B

X

V S

D

V S

LT

A B

X D

53

Induced Width• The size of the largest clique in the induced

graph is thus an indicator for the complexity of variable elimination

• This quantity is called the induced width of a graph according to the specified ordering

• Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph

54

Properties of Junction Trees

• In every junction tree:– For each cluster (or sepset) ,

– The probability distribution of any variable , using any cluster (or sepset) that contains

X)(XX P

VX V

}\{

)(V

VPX

X

55

Exact inference Using Junction Trees

• Undirected tree• Each node is a cluster • Running intersection property:

– Given two clusters and , all clusters on the path between and contain

• Separator sets (sepsets): – Intersection of adjacent clusters

X YXY YX

ADEABD DEFAD DE

Cluster ABDSepset DE

56

Constructing Junction Trees

Marrying ParentsX4

X6

X5X3

X2

X1

57

Moral GraphX4

X6

X5X3

X2

X1

58

TriangulationX4

X6

X5X3

X2

X1

59

Identify CliquesX4

X6

X5X3

X2

X1

X2X5X6X1X2X3

X2X3X5 X2X4

60

Junction Tree

• Junction tree is a subgraph of the clique graph satisfying the running intersection property

X1X2X3 X2X5X6X2X3X5X2X3 X2X5

X2

X2X5X6

X2X4

X1X2X3

X2X3X5 X2X4

61

Constructing Junction Trees

DAG

Moral Graph

Triangulated Graph

Junction Tree

Identify Cliques

62


• Lower bound for medical diagnosis ex: j j

jj

jj q

zafqzaf )()(

jij

ij

ijijij

j ij

jijiij

ij jiji

ij ijij

fdq

fdq

qd

fq

df

d

i

e

e

e

edfP

)()1()(

)(

)(

0|

0|

|0|

)(0

)( 0

1)|1(

Variational Methods for Graphical Models

Documents

Transcript of Variational Methods for Graphical Models