Chapter 10: Random Fields -...
Transcript of Chapter 10: Random Fields -...
LEARNING AND INFERENCE IN GRAPHICAL MODELS
Chapter 10: Random Fields
Dr. Martin Lauer
University of FreiburgMachine Learning Lab
Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems
Learning and Inference in Graphical Models. Chapter 10 – p. 1/38
References for this chapter
◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 8,Springer, 2006
◮ Michael Ying Yang and Wolfgang Forstner, A hierarchical conditional randomfield model for labeling and classifying images of man-made scenes. In:IEEE International Conference on Computer Vision Workshops (ICCVWorkshops), pp. 196-203, 2011
Learning and Inference in Graphical Models. Chapter 10 – p. 2/38
Motivation
Bayesian networks model clear dependencies, often causal dependencies.Bayesian networks are acyclic.
How can we model mutual and cyclic dependencies?
Example (economy):
◮ demand and supply determine the price
◮ high price fosters supply
◮ low price fosters demand
Learning and Inference in Graphical Models. Chapter 10 – p. 3/38
Motivation
Example (physics): modeling ferromagnetism in statistical mechanics
◮ a grid of magnetic dipoles in a volume
◮ every dipole causes a force on itsneighbors
◮ every dipole is forced by its neighbors
The dipoles might change their orientation.Every configuration of the magnetic dipolefield can be characterized by its energy. Theprobability of a certain configuration dependson its energy: high energy configurations areless probable, low energy configurations aremore probable.→ Ising-model (Ernst Ising, 1924)
Learning and Inference in Graphical Models. Chapter 10 – p. 4/38
Markov random fields
◮ a Markov random field (MRF) is a undirected,connected graph
◮ each node represents a random variable
• open circles indicate non-observed randomvariables
• filled circles indicate observed randomvariables
• dots indicate given constants
◮ links indicate an explicitly modeled stochasticdependence
A
B
D
C
Learning and Inference in Graphical Models. Chapter 10 – p. 5/38
Markov random fields
Joint probability distribution of a MRF is defined overcliques in the graph
Definition:A clique of size k is a subset C of k nodes of theMRF so that for each pair X, Y ∈ C with X 6= Y
holds that X and Y are connected by an edge.
Example:The MRF on the right has
◮ one clique of size 3:{X2, X3, X4}
◮ four cliques of size 2:{X1, X2}, {X2, X3}, {X2, X4}, {X3, X4}
◮ four cliques of size 1:{X1}, {X2}, {X3}, {X4}
X1
X2
X3
X4
Learning and Inference in Graphical Models. Chapter 10 – p. 6/38
Markov random fields
For every clique C in the MRF we specify a potential function
ψC : C → R>0
◮ large values of ψC indicate that a certain configuration of the randomvariables in the clique is more probable
◮ small values of ψC indicate that a certain configuration of the randomvariables in the clique is less probable
The joint distribution of the MRF is defined as the product of the potentialfunctions for all cliques
p(X1, . . . , Xn) =1
Z
∏
C∈Cliques
ψC(C)
with Z =∫∏
C∈Cliques ψC(C)d(X1, . . . , Xn) the partition function
Remark: calculating Z might be very hard in practice
Learning and Inference in Graphical Models. Chapter 10 – p. 7/38
Markov random fields
Potential functions are usually given in terms of Gibbs/Boltzmann distributions
ψC(C) = e−EC(C)
with EC : C → R an “energy function”
◮ large energy means low probability
◮ small energy means large probability
Hence, the overall probability distribution of an MRF is
p(X1, . . . , Xn) =1
Ze−
∑C∈Cliques EC(C)
Learning and Inference in Graphical Models. Chapter 10 – p. 8/38
Markov random fields
Example: let us model the food preferences of a group of four persons: Antonia,Ben, Charles, and Damaris. They might choose between pasta, fish, and meat
◮ Ben likes meat and pasta but hates fish
◮ Antonia, Ben, and Charles prefer to choose the same
◮ Charles is vegetarian
◮ Damaris prefers to choose something else than all the other
→ create an MRF on the blackboard that models the food preferences of the fourpersons and assign potential functions to the cliques.
Learning and Inference in Graphical Models. Chapter 10 – p. 9/38
Markov random fields
One way to model the food preference task
Random variables A, B, C , D model Antonias,Bens, Charles, and Damaris’ choice. Discretevariables with values 1=pasta, 2=fish, 3=meat
Energy functions which are relevant(all others are constant):
A
B
C
D
E{B}(b) =
{
0 if b ∈ {1, 3}
100 if b = 2
E{A,B,C}(a, b, c) =
{
0 if a = b = c
30 otherwise
E{C}(c) =
0 if c = 1
50 if c = 2
200 if c = 3
E{A,D}(a, d) =
{
0 if a 6= d
10 if a = d
E{B,D}(b, d) =
{
0 if b 6= d
10 if b = d
E{C,D}(c, d) =
{
0 if c 6= d
10 if c = d
Learning and Inference in Graphical Models. Chapter 10 – p. 10/38
Factor graphs
Like for Bayesian networks we can define factor graphs over MRFs.
A factor graph is a bipartite graph with two kind of nodes:
◮ variable nodes that model random variables
◮ factor nodes that model a probabilistic relationship between variable nodes.Each factor node is assigned with a potential function
Variable nodes and factor nodes are connected by undirected links.For each MRF we can create a factor graph as follows:
◮ the set of variable nodes is taken from the nodes of the MRF
◮ for each non-constant potential function ψC
• we create a new factor node f
• we connect f with all variable nodes in clique C
• we assign the potential function ψC to f
Hence, the joint probability of the MRF is equal to the Gibbs distribution over thesum of all factor potentials
Learning and Inference in Graphical Models. Chapter 10 – p. 11/38
Factor graphs
The factor graph of the food preference task looks likes
A
B
C
D
�E{B}
�E{C}
�E{A,B,V }
�E{A,D}
�E{B,D}
�E{C,D}
Learning and Inference in Graphical Models. Chapter 10 – p. 12/38
Stochastic inference in Markov random fields
How can we calculate p(U = u|O = o) and argmaxu p(U = u|O = u)?
◮ if the factor graph related to a MRF is a tree, we can use the sum-productand max-sum algorithm introduced in chapter 4.
◮ in the general case there are no efficient exact algorithms
◮ we can build variational approximations (chapter 6) for approximate inference
◮ we can use MCMC samplers (chapter 7) for numerical inference
◮ we can use local optimization (chapter 8)
Example: in the food preference task,
◮ what is the overall best choice of food?
◮ what is the best choice of food if Antonia eats fish?
Learning and Inference in Graphical Models. Chapter 10 – p. 13/38
Special types of MRFs
MRFs are very general and can be used for many purposes. Some models havebeen shown to be very useful. In this lecture, we introduce
◮ the Potts model. Useful for image segmentation and noise removal
◮ Conditional random fields. Usefule for image segmentation
◮ the Boltzmann machine. Useful for unsupervised and supervised learning
◮ Markov logic networks. Useful for logic inference on noisy data (chapter 11)
Learning and Inference in Graphical Models. Chapter 10 – p. 14/38
Potts Model
Learning and Inference in Graphical Models. Chapter 10 – p. 15/38
Potts model
The Potts model can be used for segmentation and noise removal in images andother sensor data. We discuss it in the image segmentation case
Assume,
◮ an image is composed out of several areas (e.g. foreground/background,object A/object B/background)
◮ each area has a characteristic color or gray value
◮ pixels in the image are corrupted by noise
◮ neighboring pixels are very likely to belong to the same area
How can we model these assumptions with a MRF?
Learning and Inference in Graphical Models. Chapter 10 – p. 16/38
Potts model
◮ every pixel belongs to a certain area. Wemodel it with a discrete random variableXi,j . The true class label is unobserved.
◮ the color/gray value of each pixel isdescribed by a random variable Yi,j .The color value is observed.
◮ Xi,j and Yi,j are stochasticallydependent. This dependency can bedescribed by an energy function
◮ the class labels of neighboring pixels arestochastically dependent. This can bedescribed by an energy functions.
◮ we can provide priors for the class labelas energy function on individual Xi,j
Xi−1,j−1 Xi−1,j Xi−1,j+1
Xi,j−1 Xi,j Xi,j+1
Xi+1,j−1 Xi+1,j Xi+1,j+1
Yi−1,j−1 Yi−1,j Yi−1,j+1
Yi,j−1 Yi,j Yi,j+1
Yi+1,j−1 Yi+1,j Yi+1,j+1
Learning and Inference in Graphical Models. Chapter 10 – p. 17/38
Potts model
energy functions on cliques:
◮ similarity of neighboring nodes
E{Xi,j ,Xi+1,j}(xi,j, xi+1,j) =
{
0 if xi,j = xi+1,j
1 if xi,j 6= xi+1,j
E{Xi,j ,Xi,j+1}(xi,j, xi,j+1) =
{
0 if xi,j = xi,j+1
1 if xi,j 6= xi,j+1
◮ dependecy between observed color/gray value and class label. Assumeeach class k can be characterized by a typical color/gray value ck
E{Xi,j ,Yi,j}(xi,j, yi,j) = ||Yi,j − cXi,j||
◮ overall preference for certain classes. Assume a prior distribution p over the
classes E{Xi,j}(xi,j) = − log p(Xi,j)
Learning and Inference in Graphical Models. Chapter 10 – p. 18/38
Potts model
energy function for the whole Potts model:
E = κ∑
i,j
E{Xi,j ,Yi,j}(xi,j, yi,j)
+λ∑
i,j
E{Xi,j ,Xi+1,j}
+λ∑
i,j
E{Xi,j ,Xi,j+1}
+µ∑
i,j
E{Xi,j}(xi,j)
with weighting factors κ, λ, µ ≥ 0
Learning and Inference in Graphical Models. Chapter 10 – p. 19/38
Potts model for image segmentation
Let us apply the Potts model to image segmentation as described before
Determining a segmentation is done by maximizing the conditional probabilityp(. . . , Xi,j, . . . | . . . , Yi,j, . . . ) where Yi,j are the color/gray values of a givenpicture. This is equal to minimizing the overall energy keeping the Yi,j valuesfixed.
Solution techniques:
◮ finding an exact solution is NP-hard in general, in the two-class-case O(n3)if n is the number of pixels (solution using graph cuts)
◮ local optimization
◮ MCMC sampling → Matlab-demo
Think about extensions of the Potts model that can cope with cases in which thereference colors of the segments are a priori vague or unknown → homework
Learning and Inference in Graphical Models. Chapter 10 – p. 20/38
Conditional Random Fields
Learning and Inference in Graphical Models. Chapter 10 – p. 21/38
Segmentation with Potts model revisited
Using a Potts model for segmentationrequires adequate energy functionsE{Xi,j ,Yi,j}
◮ easy for a color segmentation task withpre-specified segment colors
◮ possible for a color segmentation taskwith roughly pre-specified segmentcolors
◮ almost impossible for texture-basedsegmentation
Task: segment picture into areasof road, buildings, vegetation, sky,cars.
Idea: combine random field based segmentation with traditional classifiers (e.g.neural networks, support vector machines, decision trees, etc.)
◮ apply classifier on small patches of the image
◮ use a random field to integrate neighborhood relationships
Learning and Inference in Graphical Models. Chapter 10 – p. 22/38
Combination of random fields and classifiers
A classifier is
◮ a mapping from a vector of observations (features) to class labels
◮ a mapping from a vector of observations (features) to class probabilities
With the second definition, the classifier provides a distribution p(X|Y ) with Xthe class label and Y the observation vector.A classifier does not provide a distribution on Y nor on X .
Learning and Inference in Graphical Models. Chapter 10 – p. 23/38
Combination of random fields and classifiers
Let us try to build a Potts model integratingthe classifiers to model p(X|Y )
◮ we can model the prior on the classlabels as before using a potentialfunction
◮ we can model the relationship betweenneighboring X nodes by a potentialfunction as before
◮ we can model p(Xi,j|Yi,j) with theclassifier
Xi−1,j−1 Xi−1,j Xi−1,j+1
Xi,j−1 Xi,j Xi,j+1
Xi+1,j−1 Xi+1,j Xi+1,j+1
Yi−1,j−1 Yi−1,j Yi−1,j+1
Yi,j−1 Yi,j Yi,j+1
Yi+1,j−1 Yi+1,j Yi+1,j+1
How does the joint distribution p({Xi,j, Yi,j}) over all (i, j) look like?
The joint distribution is not fully specified since we do not know p({Yi,j})
Learning and Inference in Graphical Models. Chapter 10 – p. 24/38
Conditional random fields
Conditional random fields (CRF) overcome theproblem of missing p({Yi,j}) by modeling only
p({Xi,j}|{Yi,j}). This is sufficient if we do not
want to make inference on {Yi,j} but only on
{Xi,j}
A conditional random field consists of
◮ a set of observed nodes O
◮ a set of unobserved random variables U
◮ edges between pairs of unobserved nodes
◮ edges between observed and unobservednodes
Note that cliques in a conditional random fieldcontain at most one observed node.
A
B
D
C
E
Learning and Inference in Graphical Models. Chapter 10 – p. 25/38
Conditional random fields
For every clique that contains at least oneunobserved node we specifiy a potential functionψC : C → R>0
A CRF specifies the conditional distribution p(U |O)as
p(U |O) =1
Z
∏
C∈Cliques
ψC(C)
A
B
D
C
E
Learning and Inference in Graphical Models. Chapter 10 – p. 26/38
Example: facade segmentation
Segmentation of pictures into categories building/car/door/pavement/road/sky/vegetation/window. Work of Michael Ying Yang
Approach: Hierarchical CRF combined with random decision forest.
Result:
c.f. Yang and Forstner,2011
Learning and Inference in Graphical Models. Chapter 10 – p. 27/38
Boltzmann Machines
Learning and Inference in Graphical Models. Chapter 10 – p. 28/38
Boltzmann machines
Definition:A Boltzmann machine is a fully connected MRF with binary random variables. Itsenergy function is defined over 1-cliques and 2-cliques by:
EX(x) =−θX · x
EX,Y (x, y) =−wX,Y · x · y
with θX , wX,Y non-negative real weight factors.
Hence, if we enumerate all random variables with X1, . . . , Xn
p(x1, . . . , xn) =1
Ze∑n
i=1
∑i−1
j=1(wXi,Xj
·xi·xj)+∑n
i=1(θXi
·xi)
Note, that wX,X = 0 and wX,Y = wY,X .
Learning and Inference in Graphical Models. Chapter 10 – p. 29/38
Boltzmann machines
What is a Boltzmann machine good for?
Two tasks:
◮ pattern classification
◮ denoising of patterns
Learning and Inference in Graphical Models. Chapter 10 – p. 30/38
Boltzmann machines for pattern classification
Goal: we assume some patterns (data) which belong to different categories.Applying a pattern to the Boltzmann machine we want the Boltzman machine toreturn the appropriate class label.Structure of a Boltzmann machine for classificationThere are three different types of nodes:
◮ observed nodes O.We apply a pattern to the observed nodes by setting their value to therespective value of the pattern and never change it afterwards
◮ label nodes L.These serve as output of the Boltzmann machine. We have one label nodefor each class. Finally, the label nodes indicate the class probabilities foreach class
◮ hidden nodes H .These nodes are unobserved and used for stochastic inference on thepattern
Learning and Inference in Graphical Models. Chapter 10 – p. 31/38
Boltzmann machines for pattern classification
Process of class predicition:
1. we apply a pattern to the observed nodes, i.e. the value of i-th observednode is set to the i-th value of the pattern. Afterwards, we do not change theobserved nodes any more
2. we use Gibbs sampling to update the values of all hidden nodes H and labelnodes L, i.e. we try to determine most probable configurations ofp(L,H|O). If we are only interested in the most probable configuraton wemight also use simulated annealing to find it.
3. after a while we interpret the label nodes. We might assume that the value ofthe i-th label node is proportional to the posterior probability of the i-th class
Learning and Inference in Graphical Models. Chapter 10 – p. 32/38
Gibbs sampling for Boltzmann machines
To implement Gibbs sampling we need to knowp(Xi|X1, . . . , Xi−1, Xi+1, . . . , Xn)W.l.o.g. we get
p(Xn|X1, . . . , Xn−1) ∝ p(Xn, X1, . . . , Xn−1)
∝ e∑n
i=1
∑i−1
j=1(wXi,Xj
·xi·xj)+∑n
i=1(θXi
·xi)
= exn·
∑n−1
j=1(wXn,Xj
·xj)+θXn ·xn+∑n−1
i=1
∑i−1
j=1(wXi,Xj
·xi·xj)+∑n−1
i=1(θXi
·xi)
= exn·
∑n−1
j=1(wXn,Xj
·xj)+θXn ·xn · e∑n−1
i=1
∑i−1
j=1(wXi,Xj
·xi·xj)+∑n−1
i=1(θXi
·xi)
∝ exn·
∑n−1
j=1(wXn,Xj
·xj)+θXn ·xn
Hence,
p(Xn = 0|X1, . . . , Xn−1) =1
Z· e0
p(Xn = 1|X1, . . . , Xn−1) =1
Z· e
∑n−1
j=1(wXn,Xj
·xj)+θXn
From p(Xn = 0|X1, . . . , Xn−1) + p(Xn = 1|X1, . . . , Xn−1) = 1 follows
Z = 1 + e∑n−1
j=1(wXn,Xj
·xj)+θXn
Learning and Inference in Graphical Models. Chapter 10 – p. 33/38
Boltzmann machines for denoising
Goal: we assume that all patterns have a typical structure. Applying a pattern wewant the Boltzmann machine to return a typical pattern that is most similar to thepattern applied.Structure of a Boltzmann machine for denoisingThere are two different types of nodes:
◮ observed nodes O.We apply a pattern to the observed nodes by setting their value to therespective value of the pattern and never change it afterwards
◮ hidden nodes H .These nodes are unobserved and used for stochastic inference on thepattern
Learning and Inference in Graphical Models. Chapter 10 – p. 34/38
Boltzmann machines for denoising
Process of denoising:
1. we apply a pattern to the observed nodes, i.e. the value of i-th observednode is set to the i-th value of the pattern.
2. we use Gibbs sampling (or simulated annealing) to update the values of allhidden nodes H and observed nodes O, i.e. we try to determine mostprobable configurations of p(H,O).
3. after a while we consider the values of the observed nodes as pattern afterdenoising
Learning and Inference in Graphical Models. Chapter 10 – p. 35/38
Training of Boltzmann machines
For both tasks, we need to train a Boltzmann machine before we can use it, i.e.determine appropriate parameters wX,Y and θX
Assume we are given T training examples (patterns and labels for theclassification task, only patterns for the denoising task). Now, we want tomaximize the likelihood w.r.t. wX,Y and θXT∏
t=1
p(O(t), L(t)|{wX,Y |X, Y ∈ O ∪H ∪ L}, {θX |X ∈ O ∪H ∪ L})
→ gradient ascent (calculating the gradient is not trivial)
Learning and Inference in Graphical Models. Chapter 10 – p. 36/38
Boltzmann machines
Some remarks on Boltzmann machines:
◮ training Boltzmann machines is very time-consuming
◮ however, there are more efficient variants (restricted Boltzmann machines,deep belief networks) which are subject to recent research and which arebetter suitable for pattern recognition and machine learning
◮ we do not want to discuss Boltzmann machines in depth in this lecture sincethey have been discussed in Prof. Sperschneider’s machine learning lecturealready
Learning and Inference in Graphical Models. Chapter 10 – p. 37/38
Summary
◮ definition of Markov random fields
• joint probability distribution
• factor graph
◮ Potts model
• image segmentation example
◮ Conditional random fields
• image segmentation example of Michael Ying Yang
◮ Boltzmann machines
Learning and Inference in Graphical Models. Chapter 10 – p. 38/38