SPR 05 HiddenMarkovModel v1 - unibo.it

1

Massimo Piccardi, UTS 1

Massimo PiccardiUniversity of Technology, Sydney, Australia

Short course

A vademecum of statistical pattern recognition techniques with applications to image and video analysis

Lecture 5The hidden Markov model


Agenda

• Sequential data• Hidden Markov model• The canonical problems: evaluation, decoding and

estimation• The forward and backward procedures• The Viterbi algorithm• The Baum-Welch algorithm• Example papers• References

2


Sequential data

• So far, data from each class were assumed drawn from a single generating distribution, independently of each other (i.i.d. assumption)

• Sequential data, instead, are data sequences, not sets; the order is important

• Each sample may be generated out of a different distribution

• Examples: in time: rainfall data; in space: nucleotide pairs in DNA


Example

• Approximate head position during bending, 9 subjects

3


Discrete Markov process

• A model to describe the evolution of a system– at any given time, the system can be in one of N possible

states S = {s1,s2,…, sN}– at regularly spaced times, the system undergoes a transition

to a new state– transition between states can be described probabilistically

• By “discrete”, we mean both time-discrete and with a finite number of states

• (first-order) Markov hypothesis:

( )( )itjt

ktitjt

sqsqpsqsqsqp

===

====

−

−−

1

21

|,...,|


• A simple three-state Markov process of the weather1. Rain2. Cloudy3. Sunny

• aij = p(qt = sj | qt-1 = si)

• Stationary process

Example

State transition matrix A={aij} 1

2

a12

a11

a21

a22

3

a13

a31

a23

a32

a33

0.80.10.10.20.60.20.30.30.4

State transition diagram

4


ExampleQuestion 1• Given that the weather on day t=1 is sunny, what is the probability

that the weather for the next 7 days will be “sun, sun, rain, rain, sun, clouds, sun”?Answer:p(s3,s3,s1,s1,s3,s2,s3|q1=s3,model) = a33 a33 a31 a11 a33 a13 a32 a23 = = (0.8)(0.8)(0.1)(0.4)(0.3)(0.1)(0.2)

Question 2• What is the probability that the weather stays in the same known

state si for exactly T consecutive days?Answer:aii

T-1(1 - aii)

example courtesy of Gutierrez-Osuna; and Rabiner, Juang


Hidden Markov model

• In many cases of interest, the state of a system cannot be directly observed; rather, inferred through measurements of other variables, or observations

• This implies that the state of a system at any given time can be treated as a hidden r.v. and observations as samples

• A hidden Markov model (HMM) is a probabilistic model for a sequence of observations, O = {o1,…,ot,…,oT} and corresponding hidden states, Q = {q1,…,qt,…,qT}

5


Hidden Markov model

• Each state, qt, takes value in a finite set of N symbols, S = {s1,s2,…,sN}

• Each observation, ot, takes value in another finite set of M symbols V = {v1,v2,…,vM} (we’ll extend the model to continuous observations later)

• Fundamental hypotheses:1. Markov state transitions; more precisely, qt given qt-1 is

independent of the other previous variables:

2. Independence of each observation given its state:

( ) ( )tttttttTTt qopoqoqqoqoqop |,..,,,,,,..,| 111111 =−−++

( ) ( )11111 |,,..,,| −−− = ttttt qqpoqoqqp


Hidden Markov model

• An HMM can be depicted “rolled-out in time”:

Q

O1 … t - 1 t t+1 … T

6


HMM parameters

• A: state transition probabilities: N x N matrix

• B: observation probabilities: N x M matrix

• π: initial state probabilities: N x 1 vector

• Overall:

( )itjtij sqsqpa === −1|

( ) ( )jtktj sqvopkb === |

( )ii sqp == 1π

{ }π,B,Aλ =


Example

• A case with discrete outputs, 2 states {1,2} and 3 output values {R,Y,G}

1

2

a12

a11

a21

a22

RY

Gb1(R)

b1(G)

b1(Y)

RY

G

b2(R)

b2(G)

b2(Y)

0.800.20

0.050.95

State transition matrix

0.70

0.45

0.100.20

0.100.45

Observation matrix

0.00

1.00

Initial state vector

7


Example

• The previous example could be that of a traffic light operating in two possible modes, ‘normal’ (state 1) and ‘rush-hour’ (state 2): – during normal mode, the average duration of the green light

is the same as the red one– in rush-hour mode, the green light lasts more than the red– we don’t know when the traffic light is in whichever mode; we

can just observe its light; say, we sample it every 15 minutes– as the rush-hour time is shorter, we made up a22 smaller

than a11

“Why am I always coming from a side street during the rush hour?”


Fundamental problems

• There are three “canonical” problems for an HMM, each of them with solutions:

1. Given O and λ, measure p(O| λ) ) (evaluation)2. Given O and λ, find the best state sequence Q which

explains O (decoding)3. Given O, find λ that maximizes p(O| λ) (estimation)

• The first two are called analysis problems (the model is given), the last is a synthesis one (determine the model)

8


Fundamental problems

• Why are we interested in giving a solution to the three fundamental problems?1. Evaluation: O is basically our sample, and p(O|λ) its probability

in the model (O,λ are the equivalent of x,θ for “static” data). Being able to compute p(O|λ) allows us to perform ML/MAP classification

2. Decoding: in certain scenarios, states may have a physical interpretation (i.e. the digits of a phone number), observationsare seen as noisy measurements of the state, and state decoding is equivalent to filtering out the noise

3. Estimation: this is equivalent to density estimation; typically,we need to learn a model from a set of samples before we can use it for 1. and 2.


Classification of an observation sequence as HMM evaluation

• Let us assume that we have N classes of temporal templates (such as Run, Walk, Jump etc)

• Let us train one HMM each: λi, i=1..N• Given a new instance, O, it can be classified in the

most likely class simply as:

• Priors and costs can be added at will

( )( )iNi

Opi λ|maxarg*..1=

=

9


Evaluation

• Target: p(O|λ), likelihood of O given λ

• It could be obtained from p(O,Q|λ) by marginalising Q:

( ) ( ) ( ) ( ) ( )Tqqq

T

itt obobobqopQOp

T...,|,| 21

121

==∏=

λλ

( )TT qqqqqqq aaaQp

132211...|

−=πλ

( ) ( ) ( ) ( )∑∑∀=∀

==QqqqQ

QpQOpQOpOpT

λλλλ |,||,|...21

( ) ( ) ( ) ( )∑=∀

−=→

T

TTTqqqQ

Tqqqqqqqqqq obaaobaobOp...

2121

13222111...| πλ


Evaluation

• Unfortunately, there are NT different possible assignments for Q! – for instance, with N=5 and T=1000, they are 10699

• This makes naïve evaluation impractical. Luckily, the complexity can be reduced from exponential to linear in T

• The algorithm is known as the forward procedure. A backward procedure of equivalent computational cost is also possible

10


Forward procedure

• Let us first introduce an auxiliary quantity, αt(j):

• Initial step:

• Generic step:

– at every time step, the number of partial products in αt(j) increases by N and their length increases by 1

• Final step:

( ) ( )λα |,,, 1 jttt sqoopj == K

( ) ( )11 obj jjπα =

( ) ( )( ) ( ) NjTtobaij tj

N

iijtt KK 1,2

11 === ∑

=−αα

( ) ( )∑=

=N

iT iOp

1

| αλ


• An example with N = 3:

Example

b1(o1)π1

b3(o1)π3

b2(o1)π2

aij

b1(o2)

b3(o2)

b2(o2)

t = 1 t = 2α2(1) = π1 b1(o1) a11 b1(o2) +

π2 b2(o1) a21 b1(o2) + π3 b3(o1) a31 b1(o2)

…

…

11


Backward procedure

• A “backward” auxiliary quantity, βt(i):

• Initial step:

• Generic step:

• Final step:

( ) ( )λβ ,|,,,1 itTtt sqoopi == + K

( ) 1=iTβ

( ) ( ) ( ) NiTtjobaiN

jttjijt KK 1,11

111 =−== ∑

=++ ββ

( ) ( ) ( )∑=

=N

jjj jobOp

111| βπλ


Decoding

• You remember our hypothesis 1, that qt given qt-1 is independent of the other previous variables. Yet, not of the future ones! This means that one needs to use the entire observation sequence, O, to optimally estimate qt

• This is called batch or offline estimation, as opposed to online (estimating qt based on observations up to ot only)

• What we are after is actually the optimal state sequence q1… qT

• The solution is provided by the Viterbi algorithm

12


The Viterbi algorithm

• Let us first define an auxiliary quantity, δt(j):

that is the highest probability along any one path of the first t states and observations ending at state sj

• We also introduce the state:

that is the qt-1 state leading to δt(j)

( ) ( )λδ |,,,,,,,max 21,121, 121tjttqqqt ooosqqqqpj

t

KKK

== −−

( ) ( )( )ijtNi

t aij 1..1maxarg −

== δψ



• Our target is then state and the

sequence of states leading to it,

• We could evaluate

but this would require, again, NT evaluations

• Instead, likewise we progress partial sums in the forward procedure, we can progress partial maxima here, for a O(N2T) complexity

( )iq TNi

T δ..1

* maxarg=

=

( ) ( )( )Tqqqqqqq

T obaobqqTTT

T111

1

1,

**1 maxarg,,

−= KK

K

π

*1

*1 ,, qqT K−

13



• Initial step:

• Generic step:

• Final step:

( ) ( )( ) ANj

Njobj jj

/1

1

11

=

==

ψ

πδ K

( ) ( )( ) ( )( ) ( )( )ijt

Nit

tjijtNit

aij

TtNjobaij

1..1

1..1

maxarg

2,1max

−=

−=

=

===

δψ

δδ KK

( )( )( )( )

( ) ( ) 1,,1-T

maxarg

max

*11

*..1

*..1,, **

1

K

K

==

=

=

++

=

=

tqq

iq

ip

ttt

TNi

T

TNiqq T

ψ

δ

δ


• An example with N = 3:

Example

b1(o1)π1

b3(o1)π3

b2(o1)π2

aij

b1(o2)

b3(o2)

b2(o2)

t = 1 t = 2

…

…

…

aij

b1(oT)

b3(oT)

b2(oT)

t = T

max (δT (i))

δ2(1) = max {π1 b1(o1) a11 b1(o2),π2 b2(o1) a21 b1(o2), π3 b3(o1) a31 b1(o2)}

14


Estimation

• The Baum-Welch algorithm is an EM algorithm for estimating λ given one or more observation sequences

• A full description of the Baum-Welch algorithm is given in [Bilmes98] in two ways: – based on the forward & backward procedures– based on the EM algorithm

• Both ways lead to an equivalent solution; in the following we show the iterative re-estimation formulas for A, B, and π


The Baum-Welch algorithm

• As usual, we first define some auxiliary quantities:

which is the probability of being in state si at time t given sequence O (NB: optimal individual state, not optimal state sequence), and:

which is the probability of being in state si at time t and in state sj at time t+1 given sequence O

• γt(i) and ξt(i,j) can be updated at every iteration from αt(i)and βt(i) of the forward and backward procedures

( ) ( )λγ ,|Osqpi itt ==

( ) ( )λξ ,|,, 1 Osqsqpji jtitt === +

15


The Baum-Welch algorithm

• π:

• A:

• B:

( )ii 1γπ =

( )

( )∑

∑−

=

−

== 1

1

1

1

,T

tt

T

tt

ij

i

jia

γ

ξ

( )( )

( )∑

∑

=

== T

tt

T

ttvo

i

i

ikb

kt

1

1,

γ

γδ


Continuous observations

• In many cases of interest, the observations are continuous r.v.

• A typical model for their distribution is the GMM; thus, in a HMM we have one GMM per state (N GMMstotal)

• The re-estimation formulas are similar to those for a conventional GMM; only, this time samples belong to a distribution by a fractional membership

• Therefore, there are two fractional memberships:– the membership of sample ot to state si, given by γt(i)– and the usual responsibility pi(l|ot) i.e. the membership of the

fractional sample to the l-th component for state si

16


Continuous observations: GMM

( ) ( )

( )∑

∑

=

== T

tt

T

tt

oldti

newil

i

iolp

1

1

,|

γ

γθα

( ) ( )

( ) ( )∑

∑

=

== T

tt

oldti

T

tt

oldtit

newil

iolp

iolpo

1

1

,|

,|

γθ

γθµ

( )( ) ( ) ( )

( ) ( )∑

∑

=

=

−−=Σ T

tt

oldti

T

tt

oldti

Tnewilt

newilt

newil

iolp

iolpoo

1

1

,|

,|

γθ

γθµµ


Sampling an HMM

• An HMM is a generative model that can be sampled (i.e. obtain O)

• Procedure:– sample q1 from the π distribution– sample o1 from bq1

()for every t=2,..T– sample qt from the {a(qt-1)1…a(qt-1)N} distribution– sample ot from bqt

()

17


Some HMM variations

• HMM with explicit state duration modelling• Coupled HMM• Hierarchical HMM• Layered HMM• Factorial HMM• Input-Output HMM• Dynamic Bayesian networks (DBNs) in general


Example papers

• Application: action recognitionRecognising actions and interaction in videos

• An early paper on action recognition:Yamato, J., Ohya, J., Ishii, K., “Recognizing human action in time-sequential images using hidden Markov model,” Proceedings of CVPR '92, pp. 379-385

• Interaction (by coupled HMM):N.M. Oliver, B. Rosario, and A.P. Pentland, “A Bayesian computer vision system for modeling human interactions,” IEEE Trans. on Pattern Anal. and Machine Intell., vol. 22, no. 8, pp. 831-843, 2000441 citations, Google Scholar, 14 Nov. 2008

• An excellent review/taxonomy on action recognition:A. Bobick, “Movement, activity and action: the role of knowledge in the perception of motion,” Philosophical Transactions of the Royal Society B, vol. 352, no. 1358, August 29, 1997, pp. 1257-1265

18


Examples in action recognition: an early paper

Yamato, J., Ohya, J., Ishii, K., “Recognizing human action in time-sequential images using hidden Markov model,” Proceedings of CVPR '92, pp. 379-385

• Recognising actions of tennis players by HMMs:– “forehand stroke”– “backhand stroke”– “forehand volley”– “backhand volley”– “smash”– “service”


The feature vector

• Each frame in the sequence is background-subtracted and binarised, then converted into a mesh of M x N blocks, centered on the centroid and scaled:

• Feature vector: number of black pixels counted in each block/block size, f = {a11, a12,... aMN}

• Each feature vector is then mapped on to a symbol based on a simple vector quantization. The symbol set V is defined by an initial clustering step – total 72 symbols (12 per action)

19


The HMM

• Therefore, each frame is converted into one symbol; example:

• One HMM per action type; 36 states each (many!)• Sequence length: 23 to 70; 5 used for training and 5 for testing• Number of iterations to train each HMM: 100-150• Accuracy: 96% on ‘known’ people, 71% on different people

Every fourth frame is displayed; the corresponding symbol is underlined.In this case, the feature extraction worked ideally: all symbols belong to “forehand volley”


Modeling human interactions:a coupled HMM approach

Oliver, N. M., Rosario, B., Pentland, A. P., “A Bayesian Computer Vision System for Modeling Human Interactions,” PAMI, Vol. 22, No. 8, 2000, pp. 831-843

• Recognising interactions of two people by Coupled HMMs:– Follow, reach, and walk together (inter1)– Approach, meet, and go on separately (inter2)– Approach, meet, and go on together (inter3)– Change direction in order to meet, approach, meet, and

continue together (inter4)– Change direction in order to meet, approach, meet, and go

on separately (inter5)

20


The feature vector

• Example trajectories and feature vector for inter3:Velocity Magnitudes Alignment

Relative distance Derivative of Relative distance


Results with HMMs and CHMMs

• The CHMM:

• Results:

Q

Ot 3 t - 2 t – 1 t

O’

Q’

21


Software

• K. Murphy, Software for graphical models: a review, ISBA Bulletin, 14(4), Dec. 2007: – contains a recent review of the most popular software packages

for graphical models, HMM included• K. Murphy, Hidden Markov Model (HMM) Toolbox for Matlab,

http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html– Matlab code for HMMs for both discrete and GMM

observations• Matlab Statistics ToolboxTM 7.0

– includes functions for discrete HMMs• GHMM (from the Algorithmics group at the Max Planck Institute

for Molecular Genetics), http://www.ghmm.org/– C library implementing efficient data structures and algorithms for

basic and extended HMMs


An example with Murphy’s toolbox

• This example is slightly modified from Murphy’s “dhmm_em_demo”• We generate a sequence with the transition and observation

matrices of our traffic light; π = {1,0} (we always start from state 1)

• We train a discrete HMM starting from an initial guess where thered and green lights are equiprobable for each state, while the probability of yellow is set to a lower value; the transition probabilities between states are set to the same value

• 1 generated training sequence of length 1000• 200 iterations

22


The traffic light’s example

Q = 2;O = 3;

% "true" parametersprior0 = [1 0];transmat0 = [0.95 0.05; 0.20 0.80];mk_stochastic(transmat0)obsmat0 = [0.45 0.10 0.45; 0.20 0.10 0.70];mk_stochastic(obsmat0)

% training dataT = 1;nex = 1000;data = dhmm_sample(prior0, transmat0, obsmat0, T, nex);

% initial guess of parametersprior1 = [1 0];transmat1 = [0.8 0.2; 0.2 0.8];mk_stochastic(transmat1)obsmat1 = [0.4 0.2 0.4; 0.4 0.2 0.4];mk_stochastic(obsmat1)


The traffic light’s example (2)

% improve guess of parameters using EM[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1,

obsmat1, 'max_iter', 200);

% use model to compute log likelihoodloglik = dhmm_logprob(data, prior2, transmat2, obsmat2)% log lik is slightly different than LL(end), since it is computed

after the final M step

transmat0obsmat0transmat2obsmat2

23


The traffic light example: output

loglik =-926.2023

transmat0 =0.9500 0.05000.2000 0.8000

obsmat0 =0.4500 0.1000 0.45000.2000 0.1000 0.7000

transmat2 =0.6951 0.30490.0822 0.9178

obsmat2 =0.1467 0.0761 0.77720.4705 0.0951 0.4345

• Given the symmetry of the initial guess, the two states have been learned in swapped order

• NB: every run differs!


References

• L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989

• B. H. Juang and L. R. Rabiner, “A Probabilistic Distance Measure for Hidden Markov Models,” AT&T Tech Journ., Vol. 64, No. 2, pp. 391-408, February 1985

• Bilmes, J. “A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and Hidden Markov Models,” Tech. Rep. ICSI-TR-97-021, University of California Berkeley, 1998

• Murphy, K., Hidden Markov Model (HMM) Toolbox for Matlab, http://www.ai.mit.edu/~murphyk/Software/HMM/hmm.html

SPR 05 HiddenMarkovModel v1 - unibo.it

Documents

Transcript of SPR 05 HiddenMarkovModel v1 - unibo.it