An Introduction to Bayesian Machine Learning forMultimedia Information Processing
Part I - Introduction
A. Taylan Cemgil
Signal Processing and Communications Lab.
2008 IEEE ICME Tutorial, 23 June, Hannover
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover
Goals of this Tutorial
• Provide a basic understanding of underlying principles ofprobabilistic modeling and Bayesian inference
• Orientation in the broad literature of Bayesian machine learningand statistical signal processing
• Focus on fundamental concepts rather than technical details,
. . . we avoid heavy use of algebra by a graphical notation
. . . but there will be some maths
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 1
Goals of this Tutorial
• Model based approach
. . . rather than description of algorithms for solving specific problems
• Illustrate with examples how certain problems in multimedia signalanalysis can be approached using generic tools
• Motivate participants to investigate further
. . . provide alternative perspective to existing solutions
. . . and hopefully provide new inspiration
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 2
Part I, Introduction
• Introduction
– Bayes’ Theorem,– Trivial toy example to clarify notation
• Graphical Models
– Bayesian Networks– Undirected Graphical models, Markov Random Fields– Factor graphs
• Maximum Likelihood, Penalised Likelihood, Bayesian Learning
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 3
Part II, Basic Modelling and Inference Strategies
• Basic Building Blocks in model construction
– Probability distributions, Exponential family
• Approximate Inference
– Stochastic∗ Markov Chain Monte Carlo (MCMC), Gibbs sampler∗ Simulated Annealing, Iterative Improvement (SA - II)
– Deterministic Inference∗ Variational Bayes (VB)∗ Expectation-Maximisation (EM)∗ Iterative conditional modes (ICM)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 4
Part III, Models and Applications
• Hidden Markov Models,
– Tempo tracking, Score-performance matching– Inference in Hidden Markov Models∗ Forward Backward Algorithm∗ Viterbi∗ Exact inference by message passing: Belief Propagation
• Linear Dynamical systems, Kalman Filter Models
– Tracking– Computer Accompaniment– Kalman Filtering and Smoothing– Audio Restoration and Interpolation⋆
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 5
• Nonlinear Dynamical Systems
– Object tracking in video– Importance sampling, Particle Filtering, Sequential Monte Carlo– Switching State Space models, Changepoint Models– Pitch tracking
• Markov Random Fields
– Denoising, Source Separation
• Topic-Term-Document Models
– Latent Semantic indexing– Generative aspect model– Non Negative Matrix Factorisation
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 6
• Factorial Models, Model selection
– Audio Source Separation– Polyphonic Pitch Tracking
• Final Remarks and Bibliography
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 7
Bayes’ Theorem [1, 3]
Thomas Bayes (1702-1761)
What you know about a parameter λ after the data D arrive iswhat you knew before about λ and what the data D told you.
p(λ|D) =p(D|λ)p(λ)
p(D)
Posterior =Likelihood × Prior
Evidence
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 8
An application of Bayes’ Theorem: “Source Separation”
Given two fair dice with outcomes λ and y,
D = λ + y
What is λ when D = 9 ?
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 9
An application of Bayes’ Theorem: “Source Separation”
D = λ + y = 9
D = λ + y y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 2 3 4 5 6 7λ = 2 3 4 5 6 7 8λ = 3 4 5 6 7 8 9λ = 4 5 6 7 8 9 10λ = 5 6 7 8 9 10 11λ = 6 7 8 9 10 11 12
Bayes theorem “upgrades” p(λ) into p(λ|D).
But you have to provide an observation model: p(D|λ)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 10
“Bureaucratical” derivation
Formally we write
p(λ) = C(λ; [ 1/6 1/6 1/6 1/6 1/6 1/6 ])
p(y) = C(y; [ 1/6 1/6 1/6 1/6 1/6 1/6 ])
p(D|λ, y) = δ(D − (λ + y))
p(λ, y|D) =1
p(D)× p(D|λ, y) × p(y)p(λ)
Posterior =1
Evidence× Likelihood × Prior
Kronecker delta function denoting a degenerate (deterministic) distribution δ(x) ={
1 x = 00 x 6= 0
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 11
Prior
p(y)p(λ)
p(y) × p(λ) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 1/36 1/36 1/36 1/36 1/36 1/36λ = 2 1/36 1/36 1/36 1/36 1/36 1/36λ = 3 1/36 1/36 1/36 1/36 1/36 1/36λ = 4 1/36 1/36 1/36 1/36 1/36 1/36λ = 5 1/36 1/36 1/36 1/36 1/36 1/36λ = 6 1/36 1/36 1/36 1/36 1/36 1/36
• A table with indicies λ and y
• Each cell denotes the probability p(λ, y)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 12
Likelihood
p(D = 9|λ, y)
p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0λ = 2 0 0 0 0 0 0λ = 3 0 0 0 0 0 1λ = 4 0 0 0 0 1 0λ = 5 0 0 0 1 0 0λ = 6 0 0 1 0 0 0
• A table with indicies λ and y
• The likelihood is not a probability distribution, but a positive function.
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 13
Likelihood × Prior
φD(λ, y) = p(D = 9|λ, y)p(λ)p(y)
p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0λ = 2 0 0 0 0 0 0λ = 3 0 0 0 0 0 1/36λ = 4 0 0 0 0 1/36 0λ = 5 0 0 0 1/36 0 0λ = 6 0 0 1/36 0 0 0
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 14
Evidence (= Marginal Likelihood)
p(D = 9) =∑
λ,y
p(D = 9|λ, y)p(λ)p(y)
= 0 + 0 + · · · + 1/36 + 1/36 + 1/36 + 1/36 + 0 + · · · + 0
= 1/9
p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0λ = 2 0 0 0 0 0 0λ = 3 0 0 0 0 0 1/36λ = 4 0 0 0 0 1/36 0λ = 5 0 0 0 1/36 0 0λ = 6 0 0 1/36 0 0 0
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 15
Posterior
p(λ, y|D = 9) =1
p(D)p(D = 9|λ, y)p(λ)p(y)
p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0λ = 2 0 0 0 0 0 0λ = 3 0 0 0 0 0 1/4λ = 4 0 0 0 0 1/4 0λ = 5 0 0 0 1/4 0 0λ = 6 0 0 1/4 0 0 0
1/4 = (1/36)/(1/9)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 16
Marginal Posterior
p(λ|D) =∑
y
1
p(D)p(D|λ, y)p(λ)p(y)
p(λ|D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0 0λ = 2 0 0 0 0 0 0 0λ = 3 1/4 0 0 0 0 0 1/4λ = 4 1/4 0 0 0 0 1/4 0λ = 5 1/4 0 0 0 1/4 0 0λ = 6 1/4 0 0 1/4 0 0 0
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 17
The “proportional to” notation
p(λ|D = 9) ∝ p(λ,D = 9) =∑
y
p(D = 9|λ, y)p(λ)p(y)
p(λ,D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0 0λ = 2 0 0 0 0 0 0 0λ = 3 1/36 0 0 0 0 0 1/36λ = 4 1/36 0 0 0 0 1/36 0λ = 5 1/36 0 0 0 1/36 0 0λ = 6 1/36 0 0 1/36 0 0 0
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 18
Another application of Bayes’ Theorem: “Model Selection”
Given an unknown number of fair dice with outcomes λ1, λ2, . . . , λn,
D =n∑
i=1
λi
How many dice are there when D = 9 ?
Assume that any number n is equally likely
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 19
Another application of Bayes’ Theorem: “Model Selection”
Given all n are equally likely (i.e., p(n) is flat), we calculate (formally)
p(n|D = 9) =p(D = 9|n)p(n)
p(D)∝ p(D = 9|n)
p(D|n = 1) =∑
λ1
p(D|λ1)p(λ1)
p(D|n = 2) =∑
λ1
∑
λ2
p(D|λ1, λ2)p(λ1)p(λ2)
. . .
p(D|n = n′) =∑
λ1,...,λn′
p(D|λ1, . . . , λn′)n′∏
i=1
p(λi)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 20
p(D|n) =∑
λp(D|λ, n)p(λ|n)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
0.2
p(D
|n=
1)
D
0
0.2
p(D
|n=
2)
0
0.2
p(D
|n=
3)
0
0.2
p(D
|n=
4)0
0.2
p(D
|n=
5)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 21
Another application of Bayes’ Theorem: “Model Selection”
1 2 3 4 5 6 7 8 90
0.1
0.2
0.3
0.4
0.5
n = Number of Dice
p(n|
D =
9)
• Complex models are more flexible but they spread their probability mass
• Bayesian inference inherently prefers “simpler models” – Occam’s razor
• Computational burden: We need to sum over all parameters λ
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 22
Probabilistic Inference
A huge spectrum of applications – all boil down to computation of
• expectations of functions under probability distributions: Integration
〈f(x)〉 =
∫
X
dxp(x)f(x) 〈f(x)〉 =∑
x∈X
p(x)f(x)
• modes of functions under probability distributions: Optimization
x∗ = argmaxx∈X
p(x)f(x)
• any “mix” of the above: e.g.,
x∗ = argmaxx∈X
p(x) = argmaxx∈X
∫
Z
dzp(z)p(x|z)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 23
Divide and Conquer
Probabilistic modelling provides a methodology that puts a cleardivision between
• What to solve : Model Construction
– Both an Art and Science– Highly domain specific
• How to solve : Inference Algorithm
– Mechanical (In theory! not in practice)– Generic
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 24
Exercise
p(x1, x2) x2 = 1 x2 = 2x1 = 1 0.3 0.3x1 = 2 0.1 0.3
1. Find the following quantities
• Marginals: p(x1), p(x2)• Conditionals: p(x1|x2), p(x2|x1)• Posterior: p(x1, x2 = 2), p(x1|x2 = 2)• Evidence: p(x2 = 2)• p({})• Max: p(x∗1) = maxx1 p(x1|x2 = 1)• Mode: x∗1 = arg maxx1 p(x1|x2 = 1)• Max-marginal: maxx1 p(x1, x2)
2. Are x1 and x2 independent ? (i.e., Is p(x1, x2) = p(x1)p(x2) ?)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 25
Answers
p(x1, x2) x2 = 1 x2 = 2x1 = 1 0.3 0.3x1 = 2 0.1 0.3
• Marginals:
p(x1)x1 = 1 0.6x1 = 2 0.4
p(x2) x2 = 1 x2 = 20.4 0.6
• Conditionals:
p(x1|x2) x2 = 1 x2 = 2x1 = 1 0.75 0.5x1 = 2 0.25 0.5
p(x2|x1) x2 = 1 x2 = 2x1 = 1 0.5 0.5
x1 = 2 0.25 0.75
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 26
Answers
p(x1, x2) x2 = 1 x2 = 2x1 = 1 0.3 0.3x1 = 2 0.1 0.3
• Posterior:
p(x1, x2 = 2) x2 = 2x1 = 1 0.3x1 = 2 0.3
p(x1|x2 = 2) x2 = 2x1 = 1 0.5x1 = 2 0.5
• Evidence:p(x2 = 2) =
∑
x1
p(x1, x2 = 2) = 0.6
• Normalisation constant:
p({}) =∑
x1
∑
x2
p(x1, x2) = 1
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 27
Answers
p(x1, x2) x2 = 1 x2 = 2x1 = 1 0.3 0.3x1 = 2 0.1 0.3
• Max: (get the value)max
x1p(x1|x2 = 1) = 0.75
• Mode: (get the index)argmax
x1
p(x1|x2 = 1) = 1
• Max-marginal: (get the “skyline”) maxx1 p(x1, x2)
maxx1 p(x1, x2) x2 = 1 x2 = 20.3 0.3
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 28
Exercise: Continuous Random variables
−0.5 0 0.5 10
0.5
1
1.5
2
2.5
3
x
p(x,c=1)
p(x,c=2)
• Evaluate
– p(c), p(x = 0) and p(c|x = 10)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 29
Exercise: Continuous Random variables
• x, a conditionally Gaussian Random variable and c is discrete
p(x|c) = N (x; µ(c), v(c)) ≡1
√
2πv(c)exp(−
1
2
(x − µ(c))2
v(c))
• In this example we take
p(x, c = 1) = 0.6N (x; 0, 0.01) p(x, c = 2) = 0.4N (x; 0.2, 0.03)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 30
Solutions
• p(x = 0) is a number, that we calculate here with Matlab (or octave)
>> x = 0; mu = [0, 0.2]; v = [0.01 0.03]; pc = [0.6 0.4];>> pc. * (2 * pi * v).ˆ(-1/2). * exp(-0.5 * (x-mu).ˆ2./v)ans =
2.3937 0.4730>> sum(ans)ans =
2.8667
• Note: This works here adding exp’s can be numerically instable
• How come that the “probability” is larger than one ?
– p(x) is a density. The probability is p(x)dx
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 31
Solution
−0.5 0 0.5 10
0.5
1
1.5
2
2.5
3
x
p(x,c=1)
p(x,c=2)
p(x)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 32
Solutions
• p(c|x = 10) is a distribution,
p(c|x = 10) = p(c, x = 10)/p(x = 10)
• We calculate here with Matlab (or octave)
>> x = 10; mu = [0, 0.2]; v = [0.01 0.03]; pc = [0.6 0.4];>> pcx = pc. * (2 * pi * v).ˆ(-1/2). * exp(-0.5 * (x-mu).ˆ2./v)pcx =
0 0>> pcx/sum(pcx)Warning: Divide by zero.ans =
NaN NaN
• Problem : Underflow . We ALWAYS work with log-densities in practice
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 33
Solutions
−0.5 0 0.5 1
10−20
10−15
10−10
10−5
100
x
logp(x,c=1) logp(x,c=2)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 34
Solutions
• The log-density is
log p(x|c) = logN (x; µ(c), v(c)) ≡ −1
2log(2πv(c)) −
1
2
(x − µ(c))2
v(c)
>> lpc = log([0.6 0.4]);lpcx = lpc - 1/2 * log(2 * pi * v) -0.5 * (x-mu).ˆ2./v
1.0e+003 *
-4.9991 -1.6007>> lpcx - log(sum(exp(lpcx)))Warning: Log of zero.ans =
Inf Inf
• Problem still persists. (we exp very small numbers)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 35
Numerically Stable computation of log(∑
i exp(li)))
• Derivation
L = log(∑
i
exp(li))
= log(∑
i
exp(li)exp(l∗)
exp(l∗))
= log(exp(l∗)∑
i
exp(li − l∗))
= l∗ + log(∑
i
exp(li − l∗))
• We take l∗ as the maximum l∗ = maxi li
• Exercise: Implement above as a function logsumexp(l)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 36
Solutions
• >> lpc = log([0.6 0.4]);lpcx = lpc - 1/2 * log(2 * pi * v) -0.5 * (x-mu).ˆ2./v
1.0e+003 *-4.9991 -1.6007
>> lpcx - logsumexp(lpcx)ans =
1.0e+003 *-3.3984 0
>> exp(ans)ans =
0 1
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 37
Probability Models
+
Inference Algorithms
=
Bayesian Numerical Methods
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 38
Applications of Probability Models
• Classification
• Optimal Decision, given a loss function
• Finding interesting (hidden) structure
– Clustering, Segmentation– Dimensionality Reduction– Outlier Detection
• Finding a compact representation = Data Compression
• Prediction
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 39
Graphical Models
• formal languages for specification of probability models andassociated inference algorithms
• historically, introduced in probabilistic expert systems (Pearl 1988)as a visual guide for representing expert knowledge
• today, a standard tool in machine learning, statistics and signalprocessing
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 40
Graphical Models
• provide graph based algorithms for derivations and computation
• pedagogical insight/motivation for model/algorithm construction
– Statistics:“Kalman filter models and hidden Markov models (HMM) are equivalent uptoparametrisation”
– Signal processing:“Fast Fourier transform is an instance of sum-product algorithm on a factorgraph”
– Computer Science:“Backtracking in Prolog is equivalent to inference in Bayesian networks withdeterministic tables”
• Automated tools for code generation start to emerge, making thedesign/implement/test cycle shorter
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 41
Important types of Graphical Models
• Useful for Model Construction
– Directed Acyclic Graphs (DAG), Bayesian Networks– Undirected Graphs, Markov Networks, Random Fields– Influence diagrams– ...
• Useful for Inference
– Factor Graphs– Junction/Clique graphs– Region graphs– ...
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 42
Directed Graphical models (DAG)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 43
DAG Example: Two dice
p(λ) p(y)
λ y
D
p(D|λ, y)
p(D, λ, y) = p(D|λ, y)p(λ)p(y)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 44
DAG with observations
p(λ) p(y)
λ y
D
p(D = 9|λ, y)
φD(λ, y) = p(D = 9|λ, y)p(λ)p(y)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 45
Directed Graphical models
• Each random variable is associated with a node in the graph,
• We draw an arrow from A → B if p(B| . . . , A, . . . ) (A ∈ parent(B)),
• The edges tell us qualitatively about the factorization of the jointprobability
• For N random variables x1, . . . , xN , the distribution admits
p(x1, . . . , xN) =N∏
i=1
p(xi|parent(xi))
• Describes in a compact way an algorithm to “generate” the data –“Generative models”
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 46
Examples
Model Structure factorization
Full x1 x2 x3 x4 p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x1, x2, x3)Markov(2) x1 x2 x3 x4 p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x2, x3)Markov(1) x1 x2 x3 x4 p(x1)p(x2|x1)p(x3|x2)p(x4|x3)
x1 x2 x3 x4 p(x1)p(x2|x1)p(x3|x1)p(x4)Factorized x1 x2 x3 x4 p(x1)p(x2)p(x3)p(x4)
Removing edges eliminates a term from the conditional probability factors.
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 47
Undirected Graphical Models
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 48
Undirected Graphical Models
• Define a distribution by non-negative local compatibility functions φ(xα)
p(x) =1
Z
∏
α
φ(xα)
where α runs over cliques : fully connected subsets
• Examplesx1
x2 x3
x4
x1
x2 x3
x4
p(x) = 1Zφ(x1, x2)φ(x1, x3)φ(x2, x4)φ(x3, x4) p(x) =
1Zφ(x1, x2, x3)φ(x2, x3, x4)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 49
Possible Model Topologies
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 50
Factor graphs
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 51
Factor graphs [2]
• A bipartite graph. A powerful graphical representation of the inference problem
– Factor nodes : Black squares. Factor potentials (local functions) definingthe posterior.
– Variable nodes : White Nodes. Define collections of random variables– Edges : denote membership. A variable node is connected to a factor node
if a member variable is an argument of the local function.
p(λ) p(y)
λ y
p(D = 9|λ, y)
φD(λ, y) = p(D = 9|λ, y)p(λ)p(y) = φ1(λ, y)φ2(λ)φ3(y)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 52
Exercise
• For the following Graphical models, write down the factors of the jointdistribution and plot an equivalent factor graph and an undirected graph.
Full x1 x2 x3 x4 Markov(1) x1 x2 x3 x4HMM
h1 h2 h3 h4x1 x2 x3 x4 MIX hx1 x2 x3 x4IFA
h1 h2x1 x2 x3 x4 Factorized x1 x2 x3 x4Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 53
Answer (Markov(1))
x1 x2 x3 x4
p(x1)
x1
p(x2|x1)
x2
p(x3|x2)
x3
p(x4|x3)
x4
x1 x2 x3 x4
p(x1)p(x2|x1)︸ ︷︷ ︸
φ(x1,x2)
p(x3|x2)︸ ︷︷ ︸
φ(x2,x3)
p(x4|x3)︸ ︷︷ ︸
φ(x3,x4)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 54
Answer (IFA – Factorial)h1 h2x1 x2 x3 x4p(h1)p(h2)
4∏
i=1
p(xi|h1, h2)
h1 h2
x1 x2 x3 x4
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 55
Answer (IFA – Factorial)
h1 h2
x1 x2 x3 x4
• We can also cluster nodes together
h1, h2
x1 x2 x3 x4
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 56
Inference and Learning
• Data setD = {x1, . . . xN}
• Model with parameter λp(D|λ)
• Maximum Likelihood (ML)
λML = arg maxλ
log p(D|λ)
• Predictive distribution
p(xN+1|D) ≈ p(xN+1|λML)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 57
Regularisation
• Priorp(λ)
• Maximum a-posteriori (MAP) : Regularised Maximum Likelihood
λMAP = arg maxλ
log p(D|λ)p(λ)
• Predictive distribution
p(xN+1|D) ≈ p(xN+1|λMAP)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 58
Bayesian Learning
• We treat parameters on the same footing as all other variables
• We integrate over unknown parameters rather than using pointestimates (remember the many-dice example)
– Self-regularisation, avoids overfitting– Natural setup for online adaptation– Model selection
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 59
Bayesian Learning
• Predictive distribution
p(xN+1|D) =
∫
dλ p(xN+1|λ)p(λ|D)
λ
x1 x2 . . . xN xN+1
• Bayesian learning is just inference ...
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 60
Example Applications and Models
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 61
Audio Restoration
• During download or transmission, some samples of audio are lost
• Estimate missing samples given clean ones
0 50 100 150 200 250 300 350 400 450 500
0
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 62
Examples: Audio Restoration
p(x¬κ|xκ) ∝
∫
dHp(x¬κ|H)p(xκ|H)p(H)
H ≡ (parameters, hidden states)
H
x¬κ xκ
Missing Observed
0 50 100 150 200 250 300 350 400 450 500
0
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 63
Restoration
(Cemgil and Godsill 2005 [?])
• Piano
– Signal with missing samples (37%)– Reconstruction, 7.68 dB improvement– Original
• Trumpet
– Signal with missing samples (37%)– Reconstruction, 7.10 dB improvement– Original
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 64
piano_missing.wavMedia File (audio/wav)
piano_kalman.wavMedia File (audio/wav)
piano_clean.wavMedia File (audio/wav)
trumpet_missing.wavMedia File (audio/wav)
trumpet_kalman.wavMedia File (audio/wav)
trumpet_clean.wavMedia File (audio/wav)
Interpolation of Images
H
x¬κ xκ
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 65
Interpolation of Images
Data (25% Missing) Variational Bayes+ICM NMF ML NMF2
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 66
Medical Expert Systems
A S
T L B
E
X D
Diseases
Symptomes
Causes
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 67
Medical Expert Systems
Visit to Asia? Smoking?
Tuberclosis? Lung Cancer? Bronchitis?
Either T or L?
Positive X Ray? Dyspnoea?
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 68
Medical Expert Systems
Visit to Asia?0 %991 %1
Smoking?0 %501 %50
Tuberclosis?0 %991 %1
Lung Cancer?0 %94.51 %5.5
Bronchitis?0 %551 %45
Either T or L?0 %93.51 %6.5
Positive X Ray?0 %891 %11
Dyspnoea?0 %56.41 %43.6
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 69
Medical Expert Systems
Visit to Asia?0 %98.71 %1.3
Smoking?0 %31.21 %68.8
Tuberclosis?0 %90.81 %9.2
Lung Cancer?0 %51.11 %48.9
Bronchitis?0 %49.41 %50.6
Either T or L?0 %42.41 %57.6
Positive X Ray?0 %01 %100
Dyspnoea?0 %35.91 %64.1
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 70
Medical Expert Systems
Visit to Asia?0 %98.51 %1.5
Smoking?0 %1001 %0
Tuberclosis?0 %85.21 %14.8
Lung Cancer?0 %85.81 %14.2
Bronchitis?0 %701 %30
Either T or L?0 %71.11 %28.9
Positive X Ray?0 %01 %100
Dyspnoea?0 %561 %44
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 71
Model Selection: Variable selection in Polynomial Regress ion
• Given D = {tj, x(tj)}j=1...J , what is the order N of the polynomial?
x(t) = s1 + s2t + s3t2 + s4t
3 + · · · + ǫ(t)
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 72
Bayesian Variable Selection
C(r1; π) C(rW ; π)
r1 . . . rW
N (s1; µ(r1), Σ(r1)) s1 . . . sW N (sW ; µ(rW ), Σ(rW ))
x
N (x; Cs1:W , R)
• Generalized Linear Model – Column’s of C are the basis vectors
• The exact posterior is a mixture of 2W Gaussians
• When W is large, computation of posterior features becomes intractable.
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 73
Regression
i
0
1
2
3
4
−10
0
10
20
30
p(x,
r1:
W)
All on Configurations All off
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 74
Regression
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
1
1.2datatrueapprox
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 75
Clustering
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 76
Clustering
π Label probability
c1 c2 . . . cN Labels ∈ {a, b}
x1 x2 . . . xN Data Points
µa µb Cluster Centers
(µ∗a, µ∗b, π
∗) = argmaxµa,µb,π
∑
c1:N
N∏
i=1
p(xi|µa, µb, ci)p(ci|π)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 77
Computer vision / Cognitive Science
How many rectangles are there in this image?
0 10 20 30 40 50 60
0
5
10
15
20
25
30
35
40
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 78
Computer Vision
How many people are there in these images?
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 79
Visual Tracking
20 40 60
20
40
60
80
100
120
14020 40 60
20
40
60
80
100
120
14020 40 60
20
40
60
80
100
120
14020 40 60
20
40
60
80
100
120
140
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 80
Navigation, Robotics : Sensor Fusion
−0.5 0 0.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
−2 0 2
−2
−1
0
1
2
−20
2
−2
0
2
0
2
4
f
Lx
Ly
−0.5 0 0.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
−2 0 2
−2
−1
0
1
2
−20
2
−20
2
0
2
4
6
8
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 81
Navigation, Robotics : Sensor Fusion
GPS?t GPS status
Gt GPS reading
... Other sensors (magnetic, pressure, e.t.c.)
lt Linear accelerator sensor
ωt Gyroscope
Et−1 Et Attitude Variables
Xt−1 Xt Linear Kinematic Variables
{ξ1:Nt}t Set of feature points (Camera Frame)
{x1:Mt}t Set of feature points (World Coordinates)
ρ(x) Global Static Map (Intensity function)
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 82
Computer Accompaniment
(Music Plus One, Raphael 2000 [?], Dannenberg and Raphael 2006)
c0 c1 . . . ck−1 ck . . . cK
s0 s1 . . . sk−1 sk . . . sK
y1 . . . yk−1 yk . . . yK
ya1 . . . yak−1 y
ak . . . y
aK
a0 a1 . . . ak−1 ak . . . aK
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 83
References
[1] E. T. Jaynes. Probability Theory, The Logic of Science. Cambridge University Press, edited by G. L. Bretthorst,2003.
[2] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEETransactions on Information Theory, 47(2):498–519, February 2001.
[3] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 84
Top Related