A Tutorial on Hidden Markov Models.pdf

8/17/2019 A Tutorial on Hidden Markov Models.pdf

1/22

Introduction Forward-Backward Procedure Viterbi Algorithm Baum-Welch Reestimation Extensions

A Tutorial on Hidden Markov Modelsby Lawrence R. Rabiner

in Readings in speech recognition (1990)

Marcin Marsza lek

Visual Geometry Group

16 February 2009

Marcin Marsza lek A Tutorial on Hidden Markov Models

Figure: Andrey Markov

http://find/


2/22


Signals and signal models

Real-world processes produce signals, i.e., observable outputsdiscrete (from a codebook) vs continousstationary (with const. statistical properties) vs nonstationarypure vs corrupted (by noise)

Signal models provide basis for

signal analysis, e.g., simulationsignal processing, e.g., noise removalsignal recognition, e.g., identification

Signal models can be

deterministic – exploit some known properties of a signal

statistical – characterize statistical properties of a signal

Statistical signal models

Gaussian processesPoisson processes

Markov processesHidden Markov processes


http://find/


3/22


Signals and signal models

Real-world processes produce signals, i.e., observable outputsdiscrete (from a codebook) vs continousstationary (with const. statistical properties) vs nonstationarypure vs corrupted (by noise)

Signal models provide basis for

signal analysis, e.g., simulationsignal processing, e.g., noise removalsignal recognition, e.g., identification

Signal models can be

deterministic – exploit some known properties of a signal

statistical – characterize statistical properties of a signal

Statistical signal models

Gaussian processesPoisson processes

Markov processesHidden Markov processes


Assumption

Signal can be well characterized as a parametric randomprocess, and the parameters of the stochastic process can bedetermined in a precise, well-defined manner

http://find/


4/22


Discrete (observable) Markov model

Figure: A Markov chain with 5 states and selected transitions

N states: S 1,S 2, ..., S N

In each time instant t = 1, 2, ...,T a system changes(makes a transition) to state q t


I d i F d B k d P d Vi bi Al i h B W l h R i i E i

http://find/


5/22


Discrete (observable) Markov model

For a special case of a first order Markov chain

P (q t = S j |q t −1 = S i , t t −2 = S k ,...) = P (q t = S j |q t −1 = S i )

Furthermore we only assume processes where right-hand side

is time independent – const. state transition probabilities

aij = P (q t = S j |q t −1 = S j ) 1 ≤ i , j ≤ N

where

aij ≥ 0N

j =1

aij = 1


I t d ti F d B k d P d Vit bi Al ith B W l h R ti ti E t i

http://find/


6/22


Discrete hidden Markov model (DHMM)

Figure: Discrete HMM with 3 states and 4 possible outputs

An observation is a probabilistic function of a state, i.e.,HMM is a doubly embedded stochastic process

A DHMM is characterized byN states S j and M distinct observations v k (alphabet size)State transition probability distribution AObservation symbol probability distribution B Initial state distribution π


Introduction Forward Backward Procedure Viterbi Algorithm Baum Welch Reestimation Extensions

http://find/


7/22


Discrete hidden Markov model (DHMM)

We define the DHMM as λ = (A,B , π)A = {aij } aij = P (q t +1 = S j |q t = S i ) 1 ≤ i , j ≤ N B = {b ik } b ik = P (O t = v k |q t = S i ) 1 ≤ i ≤ N

1 ≤ k ≤ M π = {πi } πi = P (q 1 = S i ) 1 ≤ i ≤ N

This allows to generate an observation seq. O = O 1O 2...O T 1 Set t = 1, choose an initial state q 1 = S i according to the

initial state distribution π2 Choose O t = v k according to the symbol probability

distribution in state S i , i.e., b ik 3 Transit to a new state q

t +1 = S

j according

to the state transition probability distibutionfor state S i , i.e., aij

4 Set t = t + 1,if t


8/22


Three basic problems for HMMs

Evaluation Given the observation sequence O = O 1O 2...O T anda model λ = (A,B , π), how do we efficientlycompute P (O |λ), i.e., the probability of theobservation sequence given the model

Recognition Given the observation sequence O = O 1O 2...O T anda model λ = (A,B , π), how do we choose acorresponding state sequence Q = q 1q 2...q T which isoptimal in some sense, i.e., best explains theobservations

Training Given the observation sequence O = O 1O 2...O T , howdo we adjust the model parameters λ = (A,B , π) tomaximize P (O |λ)



http://find/


9/22


Brute force solution to the evaluation problem

We need P (O |λ), i.e., the probability of the observationsequence O = O 1O 2...O T given the model λ

So we can enumerate every possible state sequenceQ = q 1q 2...q T

For a sample sequence Q

P (O |Q , λ) =T

t =1

P (O t |q t , λ) =T

t =1

b q t O t

The probability of such a state sequence Q is

P (Q |λ) = P (q 1)T

t =2

P (q t |q t −1) = πq 1

T

t =2

aq t −1q t



http://find/


10/22


Brute force solution to the evaluation problem

Therefore the joint probability

P (O ,Q |λ) = P (Q |λ)P (O |Q , λ) = πq 1

T

t =2

aq t −1q t

T

t =1

b q t O t

By considering all possible state sequences

P (O |λ) =

Q

πq 1b q 1O 1

T

t =2

aq t −1q t b q t O t

Problem: order of 2TN T calculations

N T possible state sequencesabout 2T calculations for each sequence


http://find/


11/22



12/22

g

Forward procedure

Figure: Operations forcomputing the forwardvariable α j (t + 1)

Figure: Computing α j (t )in terms of a lattice



http://find/http://goback/


13/22

Backward procedure

Figure: Operations for

computing the backwardvariable β i (t )

We define a backwardvariable β i (t ) as theprobability of the partialobservation seq. after time t , given state S i at time t

β i (t ) = P (O t +1Ot + 2...O T |q t = S i , λ)

This can be computed inductively as well

β i (T ) = 1 1 ≤ i ≤ N

β i (t − 1) =N

j =1

aij b jO t β j (t ) 2 ≤ t ≤ T





14/22

Uncovering the hidden state sequence

Unlike for evaluation, there is no single “optimal” sequence

Choose states which are individually most likely(maximizes the number of correct states)Find the single best state sequence(guarantees that the uncovered sequence is valid)

The first choice means finding argmaxi γ i (t ) for each t , where

γ i (t ) = P (q t = S i |O , λ)

In terms of forward and backward variables

γ i (t ) = P (O 1...O t , q t = S i |λ)P (O t +1...O T |q t = S i , λ)

P (O |λ)

γ i (t ) = αi (t )β i (t )N j =1 α j (t )β j (t )



http://find/


15/22

Viterbi algorithm

Finding the best single sequence means computingargmaxQ P (Q |O , λ), equivalent to argmaxQ P (Q ,O |λ)

The Viterbi algorithm (dynamic programming) defines δ j (t ),i.e., the highest probability of a single path of length t whichaccounts for the observations and ends in state S j

δ j (t ) = maxq 1,q 2,...,q t −1

P (q 1q 2...q t = j ,O 1O 2...O t |λ)

By induction

δ j (1) = π j b jO 1 1 ≤ j ≤ N δ j (t + 1) =

maxi

δ i (t )aij b jO t +1 1 ≤ t ≤ T − 1

With backtracking (keeping the maximizing argument for eacht and j ) we find the optimal solution



http://find/


16/22

Backtracking

Figure: Illustration of the backtracking procedure c G.W. Pulford



http://find/


17/22

Estimation of HMM parameters

There is no known way to analytically solve for the modelwhich maximizes the probability of the observation sequence

We can choose λ = (A,B , π) which locally maximizes P (O |λ)gradient techniquesBaum-Welch reestimation (equivalent to EM)

We need to define ξ ij (t ), i.e., the probability of being in stateS i at time t and in state S j at time t + 1

ξ ij (t ) = P (q t = S i , q t +1 = S j |O , λ)

ξ ij (t ) = αi (t )aij b jO

t +1

β j (t + 1)

P (O |λ) =

= αi (t )aij b jO t +1β j (t + 1)N i =1

N j =1 αi (t )aij b jO t +1β j (t + 1)



http://find/


18/22

Estimation of HMM parameters

Figure: Operations for

computing the ξ ij (t )

Recall that γ i (t ) is a probability of state S i at time t , hence

γ i (t ) =N

j =1

ξ ij (t )

Now if we sum over the time index t T −1t =1 γ i (t ) = expected number of times that S i is visited

∗

= expected number of transitions from state S i T −1t =1 ξ ij (t ) = expected number of transitions from S i to S j





19/22

Baum-Welch Reestimation

Reestimation formulas

π̄i = γ i (1) āij =

T −1t =1 ξ ij (t )T −1t =1 γ i (t )

b̄ jk =

O t =v k

γ j (t )T

t =1 γ j (t )

Baum et al. proved that if current model is λ = (A,B , π) andwe use the above to compute λ̄ = (Ā, B̄ , π̄) then either

λ̄ = λ – we are in a critical point of the likelihood functionP (O |λ̄) > P (O |λ) – model λ̄ is more likely

If we iteratively reestimate the parameters we obtain amaximum likelihood estimate of the HMM

Unfortunately this finds a local maximum and the surface canbe very complex



http://find/


20/22

Non-ergodic HMMs

Until now we have only considered

ergodic (fully connected) HMMsevery state can be reached fromany state in a finite number of steps

Figure: Ergodic HMM

Left-right (Bakis) model good for speech recognition

as time increases the state index increases or stays the samecan be extended to parallel left-right models

Figure: Left-right HMM

Figure: Parallel HMM



http://find/


21/22

Gaussian HMM (GMMM)

HMMs can be used with continous observation densitiesWe can model such densities with Gaussian mixtures

b j O =M

m=1

c jm N (O,µ jm,U jm)

Then the reestimation formulas are still simple



http://find/


22/22

More fun

Autoregressive HMMs

State Duration Density HMMs

Discriminatively trained HMMs

maximum mutual informationinstead of maximum likelihood

HMMs in a similarity measure

Conditional Random Fields canloosely be understood as ageneralization of an HMMs

Figure: Random Oxford

fields c R. Tourtelot

constant transition probabilities replaced with arbitraryfunctions that vary across the positions in the sequence of hidden states

http://find/

A Tutorial on Hidden Markov Models.pdf

Documents

Transcript of A Tutorial on Hidden Markov Models.pdf