Hidden Markov Models./awm/tutorials/hmm14.pdf · Hidden Markov Models ... 14)
PGM 2002/03 Tirgul 1 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the...
-
Upload
norma-oliver -
Category
Documents
-
view
218 -
download
0
Transcript of PGM 2002/03 Tirgul 1 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the...
PGM 2002/03 Tirgul 1
Hidden Markov Models
Introduction
Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models, although they were developed long before the notion of general models existed (1913). They are used to model time-invariant and limited horizon models that have both an underlying mechanism (hidden states) and an observable consequence. They have been extremely successful in language modeling and speech recognition systems and are still the most widely used technique in these domains.
Markov ModelsA Markov process or model assumes that we can predict the future based just on the present (or on a limited horizon into the past):
Let {X1,…,XT} be a sequence of random variables taking values {1,…,N} then the Markov properties are:
Limited Horizon:
Time invariant (stationary):
)|(),,|( 111 tttt XiXPXXiXP
)|( 12 XiXP
Describing a Markov ChainA Markov chain can be described by the transition matrix A and initial state probabilities Q:
or alternatively:
and we calculate:
)|( 1 iXjXPa ttij )( 1 iXPqi
1
2
4
3
0.60.4
1.01.0
0.3
0.7
1.0
1
1111211 ),()|()|()(),,(
1
T
tttXTTT XXAqXXPXXPXPXXP
Hidden Markov ModelsIn a Hidden Markov Model (HMM) we do not observe the sequence that the model passed through but only some probabilistic function of it. Thus, it is a Markov model with the addition of emission probabilities:
For example:
Observed: The House is on fire
States: def noun verb prep noun
)|( iXkYPb ttik
Why use HMMs?• A lot of real-life processes are composed of underlying events generating surface phenomena. Tagging parts of speech is a common example.
• We can usually think of processes as having a limited horizon (we can easily extend to the case of a constant horizon larger than 1)
• We have efficient training algorithm using EM
• Once the model is set, we can easily run it:t=1, start in state i with probability qi
forever : move from state i to j with probability aij
emit yt=k with probability bik
t=t+1
The fundamental questionsLikelihood: Given a model =(A,B,Q), how do we efficiently compute the likelihood of an observation P(Y|)?
Decoding: Given the observation sequence Y and a model , what state sequence explains it best (MPE)?This is, for example, the tagging process of an observed sentence.
Learning: Given an observation sequence Y, and a generic model, how do we estimate the parameters that define the best model to describe the data?
Computing the LikelihoodFor any state sequence (X1,…,XT):
using P(Y,X)=P(Y|X)P(X) we get:
But, we have O(TNT) multiplications!
TT yxyxyx
T
tttt bbbXXyPXYP
22111
1),|()|(
TT xxxxxxxT aaaqXXP132211
),...,( 1
T
ttttxx
T
tyxxxx
XX
baq
XPXYPXYPYP
111
2
)()|(),()(
The trellis (lattice) algorithmTo compute likelihood: Need to enumerate over all paths in the lattice (all possible instantiations of X1…XT).
But… some starting subpath (blue) is common to many continuing paths (blue+red)
Idea: using dynamic programming, calculate a path in terms of shorter sub-paths
The trellis (lattice) algorithm
We build a matrix of the probability of being at time t at state i - t(i)=P(xt=i,y1y2…yt-1) is a function of the previous column (forward procedure):
1
2
N
j
a1j b
1Yt
a Njb NYt
a2j b2Yt
t(i) t+1(i)
N
iT
N
jjyjitt
i
iαYP
bajαiα
qiα
t
11
11
1
)()(
)()(
)(
The trellis (lattice) algorithm (cont.)We can similarly define a backwards procedure for filling the matrix
And we can easily combine:
)|()( 1 iXyyPiβ tTtt
N
iiyi
n
jtjyijtT
iβbqYP
jβbaiβiβt
11
11
)()(
)()(,1)(
1
1
)()(
)|(),(
),|(),(
),,(
),(),(
1
1111
11
1
iβbiα
iXyyPiXyyP
iXyyyyPiXyyP
yyiXyyP
iXyyPiXYP
tiyt
tTttt
ttTttt
Tttt
tTt
t
Finding the best state sequenceWe would like to the most likely path (and not just the most likely state at each time slice)
The Viterbi algorithm is an efficient trellis method for finding the MPE:
and we to reconstruct the path:
),()|( maxargmaxarg YXPYXPxx
tt jyjitj
tjyjitj
t
i
bajδiγbajδiδ
qiδ
)(maxarg)()(max)(
)(
11
1
)ˆ(ˆ
)(maxargˆ)(max)ˆ(
11
111
ttt
Ti
TTi
XγX
iδXiδXP
The Casino HMMA casino switches from a fair die (state F) to a loaded one (state U) with probability 0.05 and the other way around with probability 0.1. Also, with probability 0.01. The casino, honestly, reads off the number that was rolled.
9.01.0
05.095.0A
21
101
101
101
101
101
61
61
61
61
61
61
B
101
21
21
21
21
61
61
61
61
61
),3151166661(
FFFFFUUUUUP
The Casino HMM (cont.)What is the likelyhood of 3151166661?
Y= 3 1 5 1 1 6 6 6 6 1
1(1)=0.95, 1(2)=0.05
2(1)=0.95*0.95*1/6+0.05*0.1*1/10=0.1509
2(1)=0.95*0.05*1/6+0.05*0.9*1/10=0.0124
3(1)=0.0240,3(2)=0.0025
4(1)=0.0038,4(1)=0.0004
5(1)=0.0006,5(1)=0.0001
… all smaller then 0.0001!
The Casino HMM (cont.)What explains 3151166661 best?
Y= 3 1 5 1 1 6 6 6 6 1
1(1)=0.95, 1(2)=0.05
2(1)=max(0.95*0.95*1/6,0.05*0.1*1/10)=0.1504
2(1)=max(0.95*0.05*1/6,0.05*0.9*1/10)=0.0079
3(1)=0.0238,3(2)=0.0013
4(1)=0.0006,4(1)=0.0002
5(1)=0.0001,5(1)=0.0000…
…
The Casino HMM (cont.)An example of reconstruction using Viterbi (Durbin):
Rolls 3151162464466442453113216311641521336
Die 0000000000000000000000000000000000000
Viterbi 0000000000000000000000000000000000000
Rolls 2514454363165662656666665116645313265
Die 0000000011111111111111111111100000000
Viterbi 0000000000011111111111111111100000000
Rolls 1245636664631636663162326455236266666
Die 0000111111111111111100011111111111111
Viterbi 0000111111111111111111111111111111111
LearningIf we were given both X and Y, we could choose
Using the Maximum Likelihood principal, we simply assign the parameter for each relative frequency
What do we do when we have only Y?
ML here does not have a closed form formula!
),,|,(maxarg,
QBAXYP trainingtrainingBA
),,|(maxarg,
QBAYP trainingBA
EM (Baum Welch)Idea: Using current guess to complete data and re-estimate
Thm: Likelihood of observables never decreases!!! (to be proved later in the course)
Problems: Gets stuck at sub-optimal solutions
E-Step“Guess” X using Y
and current parameters
M-StepReestimate parameters using
current completion of data
Parameter Estimation We define the expected number of transitions from state i to j at time t:
the expected # of transition from i to j in Y is then
m ntnymymnt
tjyiyijtttt nβbbamα
jβbbaiαYjXiXPjip
tt
tt
)()(
)()()|,(),(
1
11
1
1
T
tt jip
1
),(
Parameter Esimation (cont.)We use EM re-estimation formulas using the expected counts we already have:
t j
j ktyt
jitp
jitp
istatefromemissionsofected
istatefromkofemissionsofected
ik
t jjitp
tjitp
istatefromstransitionofected
jtoistatefromstransitionofected
ij
jjiPtimeatistateinfrequencyectedi
b
a
q
),(
),(
#exp
#exp
),(
),(
#exp
#exp
),(1exp
:
1
Application: Sequence Pair AlignmnetDNA sequences are “strings” with a four letter alphabet = { C,G,A,T}.
Real-life sequences often have by way of mutation of a letter or even a deletion. A fundamental problem in computation biology is to align such sequences:
Input: Output:
CTTTACGTTACTTACG CTTTACGTTAC-TTACG
CTTAGTTACGTTAG C-TTA-GTAACGTTA-G
How can we use an HMM for such a task?
Sequence Pair Alignmnet (cont.)We construct and HMM with 3 states that emits two letters:
(M)atch: Probability of emission of aligned pairs (high probability for matching letter low for mismatch)
(D)elete1: Emission of a letter in the first sequence and an insert in the second sequence
(D)elete2: The converse of D1
Transition matrix A
M D1 D2
M 0.9 0.05 0.05
D1 0.95 0.05 0
D2 0.95 0 0.05
Q
M 0.9
D1 0.05
D2 0.05
Matrix BIf in M, emit same letter
with probability 0.24 If in D1 or D2 emit all
letters uniformly
Sequence Pair Alignmnet (cont.)How to align 2 new sequences?
We also need (B)egin and (E)nd states to signify the start and end of the sequence.
• From B we will have to same transition probabilities as for M and we will never return to it.
• We will have a probability of 0.05 to reach E from any states and we will never leave it. This probability will determine the average length of a an alignment.
We now just do Viterbi with some extra technical details (Programming ex1)
ExtensionsHMM have been used so extensively, that it is impossible to even start on the many forms they take. Several extension however are worth mentioning:
•using different kind of transition matrices
•using continuous observations
•using a larger horizon
•assigning probabilities to wait times
•using several training sets
•…