CS 59000 Statistical Machine learning Lecture 24

36
CS 59000 Statistical Machine learning Lecture 24 Yuan (Alan) Qi Purdue CS Nov. 20 2008

description

CS 59000 Statistical Machine learning Lecture 24. Yuan (Alan) Qi Purdue CS Nov. 20 2008. Outline. Review of K-medoids, Mixture of Gaussians, Expectation Maximization (EM), Alternative view of EM - PowerPoint PPT Presentation

Transcript of CS 59000 Statistical Machine learning Lecture 24

Page 1: CS 59000 Statistical Machine learning Lecture 24

CS 59000 Statistical Machine learningLecture 24

Yuan (Alan) QiPurdue CS

Nov. 20 2008

Page 2: CS 59000 Statistical Machine learning Lecture 24

Outline

• Review of K-medoids, Mixture of Gaussians, Expectation Maximization (EM), Alternative view of EM

• Hidden Markvo Models, forward-backward algorithm, EM for learning HMM parameters, Viterbi Algorithm, Linear state space models, Kalman filtering and smoothing

Page 3: CS 59000 Statistical Machine learning Lecture 24

K-medoids Algorithm

Page 4: CS 59000 Statistical Machine learning Lecture 24

Mixture of Gaussians

Mixture of Gaussians:

Introduce latent variables:

Marginal distribution:

Page 5: CS 59000 Statistical Machine learning Lecture 24

Conditional Probability

Responsibility that component k takes for explaining the observation.

Page 6: CS 59000 Statistical Machine learning Lecture 24

Maximum Likelihood

Maximize the log likelihood function

Page 7: CS 59000 Statistical Machine learning Lecture 24

Severe Overfitting by Maximum Likelihood

When a cluster has only data point, its variance goes to 0.

Page 8: CS 59000 Statistical Machine learning Lecture 24

Maximum Likelihood Conditions (1)

Setting the derivatives of to zero:

Page 9: CS 59000 Statistical Machine learning Lecture 24

Maximum Likelihood Conditions (2)

Setting the derivative of to zero:

Page 10: CS 59000 Statistical Machine learning Lecture 24

Maximum Likelihood Conditions (3)

Lagrange function:

Setting its derivative to zero and use the normalization constraint, we obtain:

Page 11: CS 59000 Statistical Machine learning Lecture 24

Expectation Maximization for Mixture Gaussians

Although the previous conditions do not provide closed-form conditions, we can use them to construct iterative updates:

E step: Compute responsibilities .M step: Compute new mean , variance ,

and mixing coefficients .Loop over E and M steps until the log

likelihood stops to increase.

Page 12: CS 59000 Statistical Machine learning Lecture 24

General EM Algorithm

Page 13: CS 59000 Statistical Machine learning Lecture 24

EM and Jensen Inequality

Goal: maximize

Define:

We haveFrom Jesen’s Inequality, we see is a lower

bound of .

Page 14: CS 59000 Statistical Machine learning Lecture 24

Lower Bound

is a functional of the distribution .

Since and , is a lower bound of the log likelihood

function . (Another way to see the lower bound without using Jensen’s inequality)

Page 15: CS 59000 Statistical Machine learning Lecture 24

Lower Bound Perspective of EM

• Expectation Step:Maximizing the functional lower bound over the distribution .

• Maximization Step:Maximizing the lower bound over the parameters .

Page 16: CS 59000 Statistical Machine learning Lecture 24

Illustration of EM Updates

Page 17: CS 59000 Statistical Machine learning Lecture 24

Sequential Data

There are temporal dependence between data points

Page 18: CS 59000 Statistical Machine learning Lecture 24

Markov ModelsBy chain rule, a joint distribution can be re-written as:

Assume conditional independence, we have

It is known as first-order Markov chain

Page 19: CS 59000 Statistical Machine learning Lecture 24

High Order Markov Chains

Second order Markov assumption

Can be generalized to higher order Markov Chains. But the number of the parameters explores exponentially with the order.

Page 20: CS 59000 Statistical Machine learning Lecture 24

State Space Models

Important graphical models for many dynamic models, includes Hidden Markov Models (HMMs) and linear dynamic systems

Questions: order for the Markov assumption

Page 21: CS 59000 Statistical Machine learning Lecture 24

Hidden Markov Models

Many applications, e.g., speech recognition, natural language processing, handwriting recognition, bio-sequence analysis

Page 22: CS 59000 Statistical Machine learning Lecture 24

From Mixture Models to HMMs

By turning a mixture Model into a dynamic model, we obtain the HMM.

Let model the dependence between two consecutive latent variables by a transition probability:

Page 23: CS 59000 Statistical Machine learning Lecture 24

HMMs

Prior on initial latent variable:

Emission probabilities:

Joint distribution:

Page 24: CS 59000 Statistical Machine learning Lecture 24

Samples from HMM

(a) Contours of constant probability density for the emission distributions corresponding to each of the three states of the latent variable. (b) A sample of 50 points drawn from the hidden Markov model, with lines connecting the successive observations.

Page 25: CS 59000 Statistical Machine learning Lecture 24

Inference: Forward-backward Algorithm

Goal: compute marginals for latent variables.Forward-backward Algorithm: exact inference

as special case of sum-product algorithm on the HMM.

Factor graph representation (grouping emission density and transition probability in one factor at a time):

Page 26: CS 59000 Statistical Machine learning Lecture 24

Forward-backward Algorithm as Message Passing Method (1)

Forward messages:

Page 27: CS 59000 Statistical Machine learning Lecture 24

Forward-backward Algorithm as Message Passing Method (2)

Backward messages (Q: how to compute it?):

The messages actually involves X

Similarly, we can compute the following (Q: why)

Page 28: CS 59000 Statistical Machine learning Lecture 24

Rescaling to Avoid OverflowingWhen a sequence is long, the forward message will become to

small to be represented by the dynamic range of the computer. We redefine the forward message

asSimilarly, we re-define the backward message

asThen, we can compute

See detailed derivation in textbook

Page 29: CS 59000 Statistical Machine learning Lecture 24

Viterbi Algorithm

Viterbi Algorithm: • Finding the most probable sequence of

states• Special case of sum-product algorithm on

HMM.What if we want to find the most probable

individual states?

Page 30: CS 59000 Statistical Machine learning Lecture 24

Maximum Likelihood Estimation for HMM

Goal: maximize

Looks familiar? Remember EM for mixture of Gaussians… Indeed the updates are similar.

Page 31: CS 59000 Statistical Machine learning Lecture 24

EM for HMM

E step:

Computed from forward-backward/sum-product algorithm

M step:

Page 32: CS 59000 Statistical Machine learning Lecture 24

Linear Dynamical Systems

Equivalently, we have

where

Page 33: CS 59000 Statistical Machine learning Lecture 24

Kalman Filtering and Smoothing

Inference on linear Gaussian systems.Kalman filtering: sequentially update scaled

forward message: Kalman smoothing: sequentially update state

beliefs based on scaled forward and backward messages:

Page 34: CS 59000 Statistical Machine learning Lecture 24

Learning in LDS

EM again…

Page 35: CS 59000 Statistical Machine learning Lecture 24

Extension of HMM and LDS

Discrete latent variables: Factorized HMMsContinuous latent variables: switching Kalman filtering models

Page 36: CS 59000 Statistical Machine learning Lecture 24