An Introduction to Bayesian Machine Learning for Multimedia Information Processing...

85
An Introduction to Bayesian Machine Learning for Multimedia Information Processing Part I - Introduction A. Taylan Cemgil Signal Processing and Communications Lab. 2008 IEEE ICME Tutorial, 23 June, Hannover Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover

Transcript of An Introduction to Bayesian Machine Learning for Multimedia Information Processing...

  • An Introduction to Bayesian Machine Learning forMultimedia Information Processing

    Part I - Introduction

    A. Taylan Cemgil

    Signal Processing and Communications Lab.

    2008 IEEE ICME Tutorial, 23 June, Hannover

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover

  • Goals of this Tutorial

    • Provide a basic understanding of underlying principles ofprobabilistic modeling and Bayesian inference

    • Orientation in the broad literature of Bayesian machine learningand statistical signal processing

    • Focus on fundamental concepts rather than technical details,

    . . . we avoid heavy use of algebra by a graphical notation

    . . . but there will be some maths

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 1

  • Goals of this Tutorial

    • Model based approach

    . . . rather than description of algorithms for solving specific problems

    • Illustrate with examples how certain problems in multimedia signalanalysis can be approached using generic tools

    • Motivate participants to investigate further

    . . . provide alternative perspective to existing solutions

    . . . and hopefully provide new inspiration

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 2

  • Part I, Introduction

    • Introduction

    – Bayes’ Theorem,– Trivial toy example to clarify notation

    • Graphical Models

    – Bayesian Networks– Undirected Graphical models, Markov Random Fields– Factor graphs

    • Maximum Likelihood, Penalised Likelihood, Bayesian Learning

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 3

  • Part II, Basic Modelling and Inference Strategies

    • Basic Building Blocks in model construction

    – Probability distributions, Exponential family

    • Approximate Inference

    – Stochastic∗ Markov Chain Monte Carlo (MCMC), Gibbs sampler∗ Simulated Annealing, Iterative Improvement (SA - II)

    – Deterministic Inference∗ Variational Bayes (VB)∗ Expectation-Maximisation (EM)∗ Iterative conditional modes (ICM)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 4

  • Part III, Models and Applications

    • Hidden Markov Models,

    – Tempo tracking, Score-performance matching– Inference in Hidden Markov Models∗ Forward Backward Algorithm∗ Viterbi∗ Exact inference by message passing: Belief Propagation

    • Linear Dynamical systems, Kalman Filter Models

    – Tracking– Computer Accompaniment– Kalman Filtering and Smoothing– Audio Restoration and Interpolation⋆

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 5

  • • Nonlinear Dynamical Systems

    – Object tracking in video– Importance sampling, Particle Filtering, Sequential Monte Carlo– Switching State Space models, Changepoint Models– Pitch tracking

    • Markov Random Fields

    – Denoising, Source Separation

    • Topic-Term-Document Models

    – Latent Semantic indexing– Generative aspect model– Non Negative Matrix Factorisation

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 6

  • • Factorial Models, Model selection

    – Audio Source Separation– Polyphonic Pitch Tracking

    • Final Remarks and Bibliography

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 7

  • Bayes’ Theorem [1, 3]

    Thomas Bayes (1702-1761)

    What you know about a parameter λ after the data D arrive iswhat you knew before about λ and what the data D told you.

    p(λ|D) =p(D|λ)p(λ)

    p(D)

    Posterior =Likelihood × Prior

    Evidence

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 8

  • An application of Bayes’ Theorem: “Source Separation”

    Given two fair dice with outcomes λ and y,

    D = λ + y

    What is λ when D = 9 ?

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 9

  • An application of Bayes’ Theorem: “Source Separation”

    D = λ + y = 9

    D = λ + y y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

    λ = 1 2 3 4 5 6 7λ = 2 3 4 5 6 7 8λ = 3 4 5 6 7 8 9λ = 4 5 6 7 8 9 10λ = 5 6 7 8 9 10 11λ = 6 7 8 9 10 11 12

    Bayes theorem “upgrades” p(λ) into p(λ|D).

    But you have to provide an observation model: p(D|λ)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 10

  • “Bureaucratical” derivation

    Formally we write

    p(λ) = C(λ; [ 1/6 1/6 1/6 1/6 1/6 1/6 ])

    p(y) = C(y; [ 1/6 1/6 1/6 1/6 1/6 1/6 ])

    p(D|λ, y) = δ(D − (λ + y))

    p(λ, y|D) =1

    p(D)× p(D|λ, y) × p(y)p(λ)

    Posterior =1

    Evidence× Likelihood × Prior

    Kronecker delta function denoting a degenerate (deterministic) distribution δ(x) ={

    1 x = 00 x 6= 0

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 11

  • Prior

    p(y)p(λ)

    p(y) × p(λ) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

    λ = 1 1/36 1/36 1/36 1/36 1/36 1/36λ = 2 1/36 1/36 1/36 1/36 1/36 1/36λ = 3 1/36 1/36 1/36 1/36 1/36 1/36λ = 4 1/36 1/36 1/36 1/36 1/36 1/36λ = 5 1/36 1/36 1/36 1/36 1/36 1/36λ = 6 1/36 1/36 1/36 1/36 1/36 1/36

    • A table with indicies λ and y

    • Each cell denotes the probability p(λ, y)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 12

  • Likelihood

    p(D = 9|λ, y)

    p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

    λ = 1 0 0 0 0 0 0λ = 2 0 0 0 0 0 0λ = 3 0 0 0 0 0 1λ = 4 0 0 0 0 1 0λ = 5 0 0 0 1 0 0λ = 6 0 0 1 0 0 0

    • A table with indicies λ and y

    • The likelihood is not a probability distribution, but a positive function.

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 13

  • Likelihood × Prior

    φD(λ, y) = p(D = 9|λ, y)p(λ)p(y)

    p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

    λ = 1 0 0 0 0 0 0λ = 2 0 0 0 0 0 0λ = 3 0 0 0 0 0 1/36λ = 4 0 0 0 0 1/36 0λ = 5 0 0 0 1/36 0 0λ = 6 0 0 1/36 0 0 0

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 14

  • Evidence (= Marginal Likelihood)

    p(D = 9) =∑

    λ,y

    p(D = 9|λ, y)p(λ)p(y)

    = 0 + 0 + · · · + 1/36 + 1/36 + 1/36 + 1/36 + 0 + · · · + 0

    = 1/9

    p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

    λ = 1 0 0 0 0 0 0λ = 2 0 0 0 0 0 0λ = 3 0 0 0 0 0 1/36λ = 4 0 0 0 0 1/36 0λ = 5 0 0 0 1/36 0 0λ = 6 0 0 1/36 0 0 0

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 15

  • Posterior

    p(λ, y|D = 9) =1

    p(D)p(D = 9|λ, y)p(λ)p(y)

    p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

    λ = 1 0 0 0 0 0 0λ = 2 0 0 0 0 0 0λ = 3 0 0 0 0 0 1/4λ = 4 0 0 0 0 1/4 0λ = 5 0 0 0 1/4 0 0λ = 6 0 0 1/4 0 0 0

    1/4 = (1/36)/(1/9)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 16

  • Marginal Posterior

    p(λ|D) =∑

    y

    1

    p(D)p(D|λ, y)p(λ)p(y)

    p(λ|D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

    λ = 1 0 0 0 0 0 0 0λ = 2 0 0 0 0 0 0 0λ = 3 1/4 0 0 0 0 0 1/4λ = 4 1/4 0 0 0 0 1/4 0λ = 5 1/4 0 0 0 1/4 0 0λ = 6 1/4 0 0 1/4 0 0 0

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 17

  • The “proportional to” notation

    p(λ|D = 9) ∝ p(λ,D = 9) =∑

    y

    p(D = 9|λ, y)p(λ)p(y)

    p(λ,D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

    λ = 1 0 0 0 0 0 0 0λ = 2 0 0 0 0 0 0 0λ = 3 1/36 0 0 0 0 0 1/36λ = 4 1/36 0 0 0 0 1/36 0λ = 5 1/36 0 0 0 1/36 0 0λ = 6 1/36 0 0 1/36 0 0 0

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 18

  • Another application of Bayes’ Theorem: “Model Selection”

    Given an unknown number of fair dice with outcomes λ1, λ2, . . . , λn,

    D =n∑

    i=1

    λi

    How many dice are there when D = 9 ?

    Assume that any number n is equally likely

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 19

  • Another application of Bayes’ Theorem: “Model Selection”

    Given all n are equally likely (i.e., p(n) is flat), we calculate (formally)

    p(n|D = 9) =p(D = 9|n)p(n)

    p(D)∝ p(D = 9|n)

    p(D|n = 1) =∑

    λ1

    p(D|λ1)p(λ1)

    p(D|n = 2) =∑

    λ1

    λ2

    p(D|λ1, λ2)p(λ1)p(λ2)

    . . .

    p(D|n = n′) =∑

    λ1,...,λn′

    p(D|λ1, . . . , λn′)n′∏

    i=1

    p(λi)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 20

  • p(D|n) =∑

    λp(D|λ, n)p(λ|n)

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

    0.2

    p(D

    |n=

    1)

    D

    0

    0.2

    p(D

    |n=

    2)

    0

    0.2

    p(D

    |n=

    3)

    0

    0.2

    p(D

    |n=

    4)0

    0.2

    p(D

    |n=

    5)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 21

  • Another application of Bayes’ Theorem: “Model Selection”

    1 2 3 4 5 6 7 8 90

    0.1

    0.2

    0.3

    0.4

    0.5

    n = Number of Dice

    p(n|

    D =

    9)

    • Complex models are more flexible but they spread their probability mass

    • Bayesian inference inherently prefers “simpler models” – Occam’s razor

    • Computational burden: We need to sum over all parameters λ

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 22

  • Probabilistic Inference

    A huge spectrum of applications – all boil down to computation of

    • expectations of functions under probability distributions: Integration

    〈f(x)〉 =

    X

    dxp(x)f(x) 〈f(x)〉 =∑

    x∈X

    p(x)f(x)

    • modes of functions under probability distributions: Optimization

    x∗ = argmaxx∈X

    p(x)f(x)

    • any “mix” of the above: e.g.,

    x∗ = argmaxx∈X

    p(x) = argmaxx∈X

    Z

    dzp(z)p(x|z)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 23

  • Divide and Conquer

    Probabilistic modelling provides a methodology that puts a cleardivision between

    • What to solve : Model Construction

    – Both an Art and Science– Highly domain specific

    • How to solve : Inference Algorithm

    – Mechanical (In theory! not in practice)– Generic

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 24

  • Exercise

    p(x1, x2) x2 = 1 x2 = 2x1 = 1 0.3 0.3x1 = 2 0.1 0.3

    1. Find the following quantities

    • Marginals: p(x1), p(x2)• Conditionals: p(x1|x2), p(x2|x1)• Posterior: p(x1, x2 = 2), p(x1|x2 = 2)• Evidence: p(x2 = 2)• p({})• Max: p(x∗1) = maxx1 p(x1|x2 = 1)• Mode: x∗1 = arg maxx1 p(x1|x2 = 1)• Max-marginal: maxx1 p(x1, x2)

    2. Are x1 and x2 independent ? (i.e., Is p(x1, x2) = p(x1)p(x2) ?)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 25

  • Answers

    p(x1, x2) x2 = 1 x2 = 2x1 = 1 0.3 0.3x1 = 2 0.1 0.3

    • Marginals:

    p(x1)x1 = 1 0.6x1 = 2 0.4

    p(x2) x2 = 1 x2 = 20.4 0.6

    • Conditionals:

    p(x1|x2) x2 = 1 x2 = 2x1 = 1 0.75 0.5x1 = 2 0.25 0.5

    p(x2|x1) x2 = 1 x2 = 2x1 = 1 0.5 0.5

    x1 = 2 0.25 0.75

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 26

  • Answers

    p(x1, x2) x2 = 1 x2 = 2x1 = 1 0.3 0.3x1 = 2 0.1 0.3

    • Posterior:

    p(x1, x2 = 2) x2 = 2x1 = 1 0.3x1 = 2 0.3

    p(x1|x2 = 2) x2 = 2x1 = 1 0.5x1 = 2 0.5

    • Evidence:p(x2 = 2) =

    x1

    p(x1, x2 = 2) = 0.6

    • Normalisation constant:

    p({}) =∑

    x1

    x2

    p(x1, x2) = 1

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 27

  • Answers

    p(x1, x2) x2 = 1 x2 = 2x1 = 1 0.3 0.3x1 = 2 0.1 0.3

    • Max: (get the value)max

    x1p(x1|x2 = 1) = 0.75

    • Mode: (get the index)argmax

    x1

    p(x1|x2 = 1) = 1

    • Max-marginal: (get the “skyline”) maxx1 p(x1, x2)

    maxx1 p(x1, x2) x2 = 1 x2 = 20.3 0.3

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 28

  • Exercise: Continuous Random variables

    −0.5 0 0.5 10

    0.5

    1

    1.5

    2

    2.5

    3

    x

    p(x,c=1)

    p(x,c=2)

    • Evaluate

    – p(c), p(x = 0) and p(c|x = 10)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 29

  • Exercise: Continuous Random variables

    • x, a conditionally Gaussian Random variable and c is discrete

    p(x|c) = N (x; µ(c), v(c)) ≡1

    2πv(c)exp(−

    1

    2

    (x − µ(c))2

    v(c))

    • In this example we take

    p(x, c = 1) = 0.6N (x; 0, 0.01) p(x, c = 2) = 0.4N (x; 0.2, 0.03)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 30

  • Solutions

    • p(x = 0) is a number, that we calculate here with Matlab (or octave)

    >> x = 0; mu = [0, 0.2]; v = [0.01 0.03]; pc = [0.6 0.4];>> pc. * (2 * pi * v).ˆ(-1/2). * exp(-0.5 * (x-mu).ˆ2./v)ans =

    2.3937 0.4730>> sum(ans)ans =

    2.8667

    • Note: This works here adding exp’s can be numerically instable

    • How come that the “probability” is larger than one ?

    – p(x) is a density. The probability is p(x)dx

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 31

  • Solution

    −0.5 0 0.5 10

    0.5

    1

    1.5

    2

    2.5

    3

    x

    p(x,c=1)

    p(x,c=2)

    p(x)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 32

  • Solutions

    • p(c|x = 10) is a distribution,

    p(c|x = 10) = p(c, x = 10)/p(x = 10)

    • We calculate here with Matlab (or octave)

    >> x = 10; mu = [0, 0.2]; v = [0.01 0.03]; pc = [0.6 0.4];>> pcx = pc. * (2 * pi * v).ˆ(-1/2). * exp(-0.5 * (x-mu).ˆ2./v)pcx =

    0 0>> pcx/sum(pcx)Warning: Divide by zero.ans =

    NaN NaN

    • Problem : Underflow . We ALWAYS work with log-densities in practice

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 33

  • Solutions

    −0.5 0 0.5 1

    10−20

    10−15

    10−10

    10−5

    100

    x

    logp(x,c=1) logp(x,c=2)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 34

  • Solutions

    • The log-density is

    log p(x|c) = logN (x; µ(c), v(c)) ≡ −1

    2log(2πv(c)) −

    1

    2

    (x − µ(c))2

    v(c)

    >> lpc = log([0.6 0.4]);lpcx = lpc - 1/2 * log(2 * pi * v) -0.5 * (x-mu).ˆ2./v

    1.0e+003 *

    -4.9991 -1.6007>> lpcx - log(sum(exp(lpcx)))Warning: Log of zero.ans =

    Inf Inf

    • Problem still persists. (we exp very small numbers)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 35

  • Numerically Stable computation of log(∑

    i exp(li)))

    • Derivation

    L = log(∑

    i

    exp(li))

    = log(∑

    i

    exp(li)exp(l∗)

    exp(l∗))

    = log(exp(l∗)∑

    i

    exp(li − l∗))

    = l∗ + log(∑

    i

    exp(li − l∗))

    • We take l∗ as the maximum l∗ = maxi li

    • Exercise: Implement above as a function logsumexp(l)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 36

  • Solutions

    • >> lpc = log([0.6 0.4]);lpcx = lpc - 1/2 * log(2 * pi * v) -0.5 * (x-mu).ˆ2./v

    1.0e+003 *-4.9991 -1.6007

    >> lpcx - logsumexp(lpcx)ans =

    1.0e+003 *-3.3984 0

    >> exp(ans)ans =

    0 1

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 37

  • Probability Models

    +

    Inference Algorithms

    =

    Bayesian Numerical Methods

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 38

  • Applications of Probability Models

    • Classification

    • Optimal Decision, given a loss function

    • Finding interesting (hidden) structure

    – Clustering, Segmentation– Dimensionality Reduction– Outlier Detection

    • Finding a compact representation = Data Compression

    • Prediction

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 39

  • Graphical Models

    • formal languages for specification of probability models andassociated inference algorithms

    • historically, introduced in probabilistic expert systems (Pearl 1988)as a visual guide for representing expert knowledge

    • today, a standard tool in machine learning, statistics and signalprocessing

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 40

  • Graphical Models

    • provide graph based algorithms for derivations and computation

    • pedagogical insight/motivation for model/algorithm construction

    – Statistics:“Kalman filter models and hidden Markov models (HMM) are equivalent uptoparametrisation”

    – Signal processing:“Fast Fourier transform is an instance of sum-product algorithm on a factorgraph”

    – Computer Science:“Backtracking in Prolog is equivalent to inference in Bayesian networks withdeterministic tables”

    • Automated tools for code generation start to emerge, making thedesign/implement/test cycle shorter

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 41

  • Important types of Graphical Models

    • Useful for Model Construction

    – Directed Acyclic Graphs (DAG), Bayesian Networks– Undirected Graphs, Markov Networks, Random Fields– Influence diagrams– ...

    • Useful for Inference

    – Factor Graphs– Junction/Clique graphs– Region graphs– ...

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 42

  • Directed Graphical models (DAG)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 43

  • DAG Example: Two dice

    p(λ) p(y)

    λ y

    D

    p(D|λ, y)

    p(D, λ, y) = p(D|λ, y)p(λ)p(y)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 44

  • DAG with observations

    p(λ) p(y)

    λ y

    D

    p(D = 9|λ, y)

    φD(λ, y) = p(D = 9|λ, y)p(λ)p(y)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 45

  • Directed Graphical models

    • Each random variable is associated with a node in the graph,

    • We draw an arrow from A → B if p(B| . . . , A, . . . ) (A ∈ parent(B)),

    • The edges tell us qualitatively about the factorization of the jointprobability

    • For N random variables x1, . . . , xN , the distribution admits

    p(x1, . . . , xN) =N∏

    i=1

    p(xi|parent(xi))

    • Describes in a compact way an algorithm to “generate” the data –“Generative models”

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 46

  • Examples

    Model Structure factorization

    Full x1 x2 x3 x4 p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x1, x2, x3)Markov(2) x1 x2 x3 x4 p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x2, x3)Markov(1) x1 x2 x3 x4 p(x1)p(x2|x1)p(x3|x2)p(x4|x3)

    x1 x2 x3 x4 p(x1)p(x2|x1)p(x3|x1)p(x4)Factorized x1 x2 x3 x4 p(x1)p(x2)p(x3)p(x4)

    Removing edges eliminates a term from the conditional probability factors.

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 47

  • Undirected Graphical Models

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 48

  • Undirected Graphical Models

    • Define a distribution by non-negative local compatibility functions φ(xα)

    p(x) =1

    Z

    α

    φ(xα)

    where α runs over cliques : fully connected subsets

    • Examplesx1

    x2 x3

    x4

    x1

    x2 x3

    x4

    p(x) = 1Zφ(x1, x2)φ(x1, x3)φ(x2, x4)φ(x3, x4) p(x) =

    1Zφ(x1, x2, x3)φ(x2, x3, x4)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 49

  • Possible Model Topologies

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 50

  • Factor graphs

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 51

  • Factor graphs [2]

    • A bipartite graph. A powerful graphical representation of the inference problem

    – Factor nodes : Black squares. Factor potentials (local functions) definingthe posterior.

    – Variable nodes : White Nodes. Define collections of random variables– Edges : denote membership. A variable node is connected to a factor node

    if a member variable is an argument of the local function.

    p(λ) p(y)

    λ y

    p(D = 9|λ, y)

    φD(λ, y) = p(D = 9|λ, y)p(λ)p(y) = φ1(λ, y)φ2(λ)φ3(y)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 52

  • Exercise

    • For the following Graphical models, write down the factors of the jointdistribution and plot an equivalent factor graph and an undirected graph.

    Full x1 x2 x3 x4 Markov(1) x1 x2 x3 x4HMM

    h1 h2 h3 h4x1 x2 x3 x4 MIX hx1 x2 x3 x4IFA

    h1 h2x1 x2 x3 x4 Factorized x1 x2 x3 x4Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 53

  • Answer (Markov(1))

    x1 x2 x3 x4

    p(x1)

    x1

    p(x2|x1)

    x2

    p(x3|x2)

    x3

    p(x4|x3)

    x4

    x1 x2 x3 x4

    p(x1)p(x2|x1)︸ ︷︷ ︸

    φ(x1,x2)

    p(x3|x2)︸ ︷︷ ︸

    φ(x2,x3)

    p(x4|x3)︸ ︷︷ ︸

    φ(x3,x4)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 54

  • Answer (IFA – Factorial)h1 h2x1 x2 x3 x4p(h1)p(h2)

    4∏

    i=1

    p(xi|h1, h2)

    h1 h2

    x1 x2 x3 x4

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 55

  • Answer (IFA – Factorial)

    h1 h2

    x1 x2 x3 x4

    • We can also cluster nodes together

    h1, h2

    x1 x2 x3 x4

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 56

  • Inference and Learning

    • Data setD = {x1, . . . xN}

    • Model with parameter λp(D|λ)

    • Maximum Likelihood (ML)

    λML = arg maxλ

    log p(D|λ)

    • Predictive distribution

    p(xN+1|D) ≈ p(xN+1|λML)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 57

  • Regularisation

    • Priorp(λ)

    • Maximum a-posteriori (MAP) : Regularised Maximum Likelihood

    λMAP = arg maxλ

    log p(D|λ)p(λ)

    • Predictive distribution

    p(xN+1|D) ≈ p(xN+1|λMAP)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 58

  • Bayesian Learning

    • We treat parameters on the same footing as all other variables

    • We integrate over unknown parameters rather than using pointestimates (remember the many-dice example)

    – Self-regularisation, avoids overfitting– Natural setup for online adaptation– Model selection

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 59

  • Bayesian Learning

    • Predictive distribution

    p(xN+1|D) =

    dλ p(xN+1|λ)p(λ|D)

    λ

    x1 x2 . . . xN xN+1

    • Bayesian learning is just inference ...

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 60

  • Example Applications and Models

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 61

  • Audio Restoration

    • During download or transmission, some samples of audio are lost

    • Estimate missing samples given clean ones

    0 50 100 150 200 250 300 350 400 450 500

    0

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 62

  • Examples: Audio Restoration

    p(x¬κ|xκ) ∝

    dHp(x¬κ|H)p(xκ|H)p(H)

    H ≡ (parameters, hidden states)

    H

    x¬κ xκ

    Missing Observed

    0 50 100 150 200 250 300 350 400 450 500

    0

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 63

  • Restoration

    (Cemgil and Godsill 2005 [?])

    • Piano

    – Signal with missing samples (37%)– Reconstruction, 7.68 dB improvement– Original

    • Trumpet

    – Signal with missing samples (37%)– Reconstruction, 7.10 dB improvement– Original

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 64

    piano_missing.wavMedia File (audio/wav)

    piano_kalman.wavMedia File (audio/wav)

    piano_clean.wavMedia File (audio/wav)

    trumpet_missing.wavMedia File (audio/wav)

    trumpet_kalman.wavMedia File (audio/wav)

    trumpet_clean.wavMedia File (audio/wav)

  • Interpolation of Images

    H

    x¬κ xκ

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 65

  • Interpolation of Images

    Data (25% Missing) Variational Bayes+ICM NMF ML NMF2

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 66

  • Medical Expert Systems

    A S

    T L B

    E

    X D

    Diseases

    Symptomes

    Causes

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 67

  • Medical Expert Systems

    Visit to Asia? Smoking?

    Tuberclosis? Lung Cancer? Bronchitis?

    Either T or L?

    Positive X Ray? Dyspnoea?

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 68

  • Medical Expert Systems

    Visit to Asia?0 %991 %1

    Smoking?0 %501 %50

    Tuberclosis?0 %991 %1

    Lung Cancer?0 %94.51 %5.5

    Bronchitis?0 %551 %45

    Either T or L?0 %93.51 %6.5

    Positive X Ray?0 %891 %11

    Dyspnoea?0 %56.41 %43.6

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 69

  • Medical Expert Systems

    Visit to Asia?0 %98.71 %1.3

    Smoking?0 %31.21 %68.8

    Tuberclosis?0 %90.81 %9.2

    Lung Cancer?0 %51.11 %48.9

    Bronchitis?0 %49.41 %50.6

    Either T or L?0 %42.41 %57.6

    Positive X Ray?0 %01 %100

    Dyspnoea?0 %35.91 %64.1

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 70

  • Medical Expert Systems

    Visit to Asia?0 %98.51 %1.5

    Smoking?0 %1001 %0

    Tuberclosis?0 %85.21 %14.8

    Lung Cancer?0 %85.81 %14.2

    Bronchitis?0 %701 %30

    Either T or L?0 %71.11 %28.9

    Positive X Ray?0 %01 %100

    Dyspnoea?0 %561 %44

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 71

  • Model Selection: Variable selection in Polynomial Regress ion

    • Given D = {tj, x(tj)}j=1...J , what is the order N of the polynomial?

    x(t) = s1 + s2t + s3t2 + s4t

    3 + · · · + ǫ(t)

    −1 −0.5 0 0.5 1−0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 72

  • Bayesian Variable Selection

    C(r1; π) C(rW ; π)

    r1 . . . rW

    N (s1; µ(r1), Σ(r1)) s1 . . . sW N (sW ; µ(rW ), Σ(rW ))

    x

    N (x; Cs1:W , R)

    • Generalized Linear Model – Column’s of C are the basis vectors

    • The exact posterior is a mixture of 2W Gaussians

    • When W is large, computation of posterior features becomes intractable.

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 73

  • Regression

    i

    0

    1

    2

    3

    4

    −10

    0

    10

    20

    30

    p(x,

    r1:

    W)

    All on Configurations All off

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 74

  • Regression

    −1 −0.5 0 0.5 1−0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2datatrueapprox

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 75

  • Clustering

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 76

  • Clustering

    π Label probability

    c1 c2 . . . cN Labels ∈ {a, b}

    x1 x2 . . . xN Data Points

    µa µb Cluster Centers

    (µ∗a, µ∗b, π

    ∗) = argmaxµa,µb,π

    c1:N

    N∏

    i=1

    p(xi|µa, µb, ci)p(ci|π)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 77

  • Computer vision / Cognitive Science

    How many rectangles are there in this image?

    0 10 20 30 40 50 60

    0

    5

    10

    15

    20

    25

    30

    35

    40

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 78

  • Computer Vision

    How many people are there in these images?

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 79

  • Visual Tracking

    20 40 60

    20

    40

    60

    80

    100

    120

    14020 40 60

    20

    40

    60

    80

    100

    120

    14020 40 60

    20

    40

    60

    80

    100

    120

    14020 40 60

    20

    40

    60

    80

    100

    120

    140

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 80

  • Navigation, Robotics : Sensor Fusion

    −0.5 0 0.5−0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    −2 0 2

    −2

    −1

    0

    1

    2

    −20

    2

    −2

    0

    2

    0

    2

    4

    f

    Lx

    Ly

    −0.5 0 0.5−0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    −2 0 2

    −2

    −1

    0

    1

    2

    −20

    2

    −20

    2

    0

    2

    4

    6

    8

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 81

  • Navigation, Robotics : Sensor Fusion

    GPS?t GPS status

    Gt GPS reading

    ... Other sensors (magnetic, pressure, e.t.c.)

    lt Linear accelerator sensor

    ωt Gyroscope

    Et−1 Et Attitude Variables

    Xt−1 Xt Linear Kinematic Variables

    {ξ1:Nt}t Set of feature points (Camera Frame)

    {x1:Mt}t Set of feature points (World Coordinates)

    ρ(x) Global Static Map (Intensity function)

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 82

  • Computer Accompaniment

    (Music Plus One, Raphael 2000 [?], Dannenberg and Raphael 2006)

    c0 c1 . . . ck−1 ck . . . cK

    s0 s1 . . . sk−1 sk . . . sK

    y1 . . . yk−1 yk . . . yK

    ya1 . . . yak−1 y

    ak . . . y

    aK

    a0 a1 . . . ak−1 ak . . . aK

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 83

  • References

    [1] E. T. Jaynes. Probability Theory, The Logic of Science. Cambridge University Press, edited by G. L. Bretthorst,2003.

    [2] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEETransactions on Information Theory, 47(2):498–519, February 2001.

    [3] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.

    Cemgil IEEE ICME 2008 - Introduction to Bayesian Machine Learning for Multimedia. June 23, 2008, Hannover 84