PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning...

30
Overview Why learning from data possible? Chow-Liu Tree Algorithm PGM 4 — Learning structures Qinfeng (Javen) Shi ML session 11 Qinfeng (Javen) Shi PGM 4 — Learning structures

Transcript of PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning...

Page 1: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

PGM 4 — Learning structures

Qinfeng (Javen) Shi

ML session 11

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 2: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Table of Contents I

1 OverviewManually specify a graphUse simple rulesLearn from data

2 Why learning from data possible?

3 Chow-Liu Tree AlgorithmMutual info and KL divergenceMinimise KL divergenceAlgorithm

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 3: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Overview

Given graph, we can learn the parameters, and do inference.

How to get the graph at the first place?

Manually specify a graph (based on domain knowledge, orexperience)

Use simple rules

Learn from data

Learn from labels (Chow-Liu Tree Algorithm (1968))Learn from both labels and features.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 4: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Overview

Given graph, we can learn the parameters, and do inference.

How to get the graph at the first place?

Manually specify a graph (based on domain knowledge, orexperience)

Use simple rules

Learn from data

Learn from labels (Chow-Liu Tree Algorithm (1968))Learn from both labels and features.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 5: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Overview

Given graph, we can learn the parameters, and do inference.

How to get the graph at the first place?

Manually specify a graph (based on domain knowledge, orexperience)

Use simple rules

Learn from data

Learn from labels (Chow-Liu Tree Algorithm (1968))Learn from both labels and features.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 6: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Manually specify a graph

Image denoising1

Applications in Vision and PR

Image denoising

Original CorrectedNoisy

Denoising

Real Applications

),( ii yxΦ

),( ji xxΨ

X ∗ = argmaxX P(X |Y )

1This example is from Tiberio Caetano’s short course: “Machine Learningusing Graphical Models”

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 7: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Manually specify a graph

Pose estimation 2

2Left pic from http://cs.brown.edu/~ls/research.html; mid and rightpics from http://ieeexplore.ieee.org/ieee_pilot/articles/

96jproc10/96jproc10-wu/article.html

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 8: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Manually specify a graph

Pose estimation 2

2Left pic from http://cs.brown.edu/~ls/research.html; mid and rightpics from http://ieeexplore.ieee.org/ieee_pilot/articles/

96jproc10/96jproc10-wu/article.html

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 9: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Use simple rules

Image segmentation and object recognition (Gould&He, comm. acm 14)

Rules: each small segment (called super-pixel) is a node; if two segments

(nodes) are adjacent, add an edge between them.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 10: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Use simple rules

Scene understanding on the street (driverless cars)

(a) original image (b) graph via super-pixel adjacency

(c) graph via distance mst (minimal

spanning tree)

Show KITTI dataset if internet works

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 11: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Use simple rules

Tracking (Zhang&van der Maaten CVPR13)

Rule: each bounding box is a node, and find distance mst (minimal spanningtree) among all nodes.

Show their demo if internet worksQinfeng (Javen) Shi PGM 4 — Learning structures

Page 12: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Manually specify a graphUse simple rulesLearn from data

Learn from labels

Multiple labels classification (Tan etal CVPR15)

(d) Images of “plane” in PASCAL2007 database

plane

bike bird

boat

bottle

bus

car

cat

chair

table cow

dog

horse motor

person

plant

sheep

sofa

train

tv

(e) Learn from labels

plane bike bird boat bottle

bus

car

cat

chair

table

cow dog

horse

motor person

plant

sheep

sofa

train

tv

(f) Learn from labels and fea-tures

Figure : Comparison of graphs learned from PASCAL2007Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 13: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Why learning from data possible?

PGM represents the distribution, and you can estimate thedistribution

PGM carries independencies and dependencies, and you canestimate them too

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 14: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Why learning from data possible?

PGM represents the distribution, and you can estimate thedistribution

PGM carries independencies and dependencies, and you canestimate them too

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 15: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Why learning from data possible?

PGM represents the distribution, and you can estimate thedistribution

PGM carries independencies and dependencies, and you canestimate them too

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 16: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Chow-Liu Tree Algorithm

The goal: to find a tree structure3 PGM whose distribution isclosest to the underlying distribution.

learn from labels only

proposed for Bayes Net initially (directed edges) in 1968

works for MRFs too with slight modification (undirectededges)

3Not for all trees: for Bayes Net, each node can have at most 1 parent.Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 17: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Mutual info

Definition (Mutual information)

I (Xi ,Xj) =∑xi ,xj

P(xi , xj) log( P(xi , xj)

P(xi )P(xj)

)Properties:

I (Xi ,Xj) ≥ 0.

I (Xi ,Xj) = 0 if and only if Xi ,Xj are independent. (prove it inAssignment 3).

Intuition: the higher I (Xi ,Xj) is, the more correlated Xi ,Xj are.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 18: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Mutual info

Definition (Mutual information)

I (Xi ,Xj) =∑xi ,xj

P(xi , xj) log( P(xi , xj)

P(xi )P(xj)

)Properties:

I (Xi ,Xj) ≥ 0.

I (Xi ,Xj) = 0 if and only if Xi ,Xj are independent. (prove it inAssignment 3).

Intuition: the higher I (Xi ,Xj) is, the more correlated Xi ,Xj are.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 19: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

KL divergence

Definition ( Kullback-Leibler divergence)

For any distributions P(x) and P ′(x) over x,

KL(P(x)||P ′(x)) =∑x

P(x) logP(x)

P ′(x)

Properties:

KL(P(x)||P ′(x)) ≥ 0.

KL(P(x)||P ′(x)) = 0 if and only if P(x),P ′(x) are the same.(prove it in Assignment 3).

Intuition: the smaller KL(P(x)||P ′(x)) is, the closer P(x),P ′(x)are.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 20: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

KL divergence

Definition ( Kullback-Leibler divergence)

For any distributions P(x) and P ′(x) over x,

KL(P(x)||P ′(x)) =∑x

P(x) logP(x)

P ′(x)

Properties:

KL(P(x)||P ′(x)) ≥ 0.

KL(P(x)||P ′(x)) = 0 if and only if P(x),P ′(x) are the same.(prove it in Assignment 3).

Intuition: the smaller KL(P(x)||P ′(x)) is, the closer P(x),P ′(x)are.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 21: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Minimise KL divergence

Setting: Let P(x) be the joint distribution of n discrete variablesx1, x2, ..., xn, where x denotes (x1, x2, ..., xn).

Goal: For Bayes Net, we seek a tree structure, whose Pt(x) isclosest to P(x) in the sense of smallest KL(P(x)||Pt(x)) with onecondition: each node in the tree t has at most 1 parent. (showsome examples in the document camera)

minKL(P(x)||Pt(x))

Bayes Net: Pt(x) =∏n

i=1 P(xi |pat(xi )).Note: pat(xi ) (i.e. parent of xi ) is encoded in the tree t.

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 22: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Minimise KL divergence

KL(P(x)||Pt(x)) =∑x

P(x) logP(x)

Pt(x)

= −∑x

P(x) logPt(x) +∑x

P(x) logP(x)

= −∑x

P(x) log( n∏

i=1

P(xi |pat(xi )))

+∑x

P(x) logP(x)

= −∑x

P(x)n∑

i=1

log(P(xi , pat(xi ))

P(pat(xi ))

)+∑x

P(x) logP(x)

= −∑x

P(x)n∑

i=1

log(P(xi , pat(xi ))P(xi )

P(xi )P(pat(xi ))

)+∑x

P(x) logP(x)

= −∑x

P(x)n∑

i=1

log( P(xi , pat(xi ))

P(xi )P(pat(xi ))

)−∑x

P(x)n∑

i=1

logP(xi )︸ ︷︷ ︸:=const1

+∑x

P(x) logP(x)︸ ︷︷ ︸:=const2

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 23: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Minimise KL divergence

= −∑x

P(x)n∑

i=1

log( P(xi , pat(xi ))

P(xi )P(pat(xi ))

)−∑x

P(x)n∑

i=1

logP(xi )︸ ︷︷ ︸:=const1

+∑x

P(x) logP(x)︸ ︷︷ ︸:=const2

= −n∑

i=1

[∑x

P(x) log( P(xi , pat(xi ))

P(xi )P(pat(xi ))

)]− const1 + const2

= −n∑

i=1

[ ∑xi ,pat (xi )

P(xi , pat(xi )) log( P(xi , pat(xi ))

P(xi )P(pat(xi ))

)︸ ︷︷ ︸

I (Xi ,pat (Xi ))

]− const1 + const2

= −n∑

i=1

I (Xi , pat(Xi ))− const1 + const2

So to minimise KL div is to maximise mutual info,

argmint∈T

KL(P(x)||Pt(x)) = argmaxt∈T

n∑i=1

I (Xi , pat(Xi )).

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 24: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Algorithm

Steps (both Bayes Nets and MRFs):

1 compute all pairwise I (Xi ,Xj)

2 find the maximal spanning tree w.r.t. mutual info

3 decide the direction of the edges for Bayes Nets (this step isnot needed for MRFs)

How to do step 2?Sort I (Xi ,Xj) from highest to lowest. Keep adding edges in thatorder, and skip if adding it would cause a loop (i.e. not a tree anymore).How many edges to add?n − 1 many edges (for n many nodes/variables).Example in document camera

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 25: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Algorithm

Steps (both Bayes Nets and MRFs):

1 compute all pairwise I (Xi ,Xj)

2 find the maximal spanning tree w.r.t. mutual info

3 decide the direction of the edges for Bayes Nets (this step isnot needed for MRFs)

How to do step 2?

Sort I (Xi ,Xj) from highest to lowest. Keep adding edges in thatorder, and skip if adding it would cause a loop (i.e. not a tree anymore).How many edges to add?n − 1 many edges (for n many nodes/variables).Example in document camera

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 26: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Algorithm

Steps (both Bayes Nets and MRFs):

1 compute all pairwise I (Xi ,Xj)

2 find the maximal spanning tree w.r.t. mutual info

3 decide the direction of the edges for Bayes Nets (this step isnot needed for MRFs)

How to do step 2?Sort I (Xi ,Xj) from highest to lowest. Keep adding edges in thatorder, and skip if adding it would cause a loop (i.e. not a tree anymore).

How many edges to add?n − 1 many edges (for n many nodes/variables).Example in document camera

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 27: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Algorithm

Steps (both Bayes Nets and MRFs):

1 compute all pairwise I (Xi ,Xj)

2 find the maximal spanning tree w.r.t. mutual info

3 decide the direction of the edges for Bayes Nets (this step isnot needed for MRFs)

How to do step 2?Sort I (Xi ,Xj) from highest to lowest. Keep adding edges in thatorder, and skip if adding it would cause a loop (i.e. not a tree anymore).How many edges to add?

n − 1 many edges (for n many nodes/variables).Example in document camera

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 28: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Algorithm

Steps (both Bayes Nets and MRFs):

1 compute all pairwise I (Xi ,Xj)

2 find the maximal spanning tree w.r.t. mutual info

3 decide the direction of the edges for Bayes Nets (this step isnot needed for MRFs)

How to do step 2?Sort I (Xi ,Xj) from highest to lowest. Keep adding edges in thatorder, and skip if adding it would cause a loop (i.e. not a tree anymore).How many edges to add?n − 1 many edges (for n many nodes/variables).

Example in document camera

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 29: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

Algorithm

Steps (both Bayes Nets and MRFs):

1 compute all pairwise I (Xi ,Xj)

2 find the maximal spanning tree w.r.t. mutual info

3 decide the direction of the edges for Bayes Nets (this step isnot needed for MRFs)

How to do step 2?Sort I (Xi ,Xj) from highest to lowest. Keep adding edges in thatorder, and skip if adding it would cause a loop (i.e. not a tree anymore).How many edges to add?n − 1 many edges (for n many nodes/variables).Example in document camera

Qinfeng (Javen) Shi PGM 4 — Learning structures

Page 30: PGM 4 | Learning structuresjaven/talk/ML11_PGM4_Learning... · 2019-02-12 · Overview Why learning from data possible? Chow-Liu Tree Algorithm Table of ContentsI 1 Overview Manually

OverviewWhy learning from data possible?

Chow-Liu Tree Algorithm

Mutual info and KL divergenceMinimise KL divergenceAlgorithm

That’s all

Thanks!

Qinfeng (Javen) Shi PGM 4 — Learning structures