Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the...
Transcript of 1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the...
1
Hidden Markov Models (HMMs)
2
Definition
• Hidden Markov Model is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters.
• The challenge is to determine the hidden parameters from the observable parameters.
3
State Transitions
Markov Model Example. --x = States of the Markov model -- a = Transition probabilities -- b = Output probabilities -- y = Observable outputs
-How does this differ from a Finite State machine?
4
Example
• Distant friend that you talk to daily about his activities (walk, shop, clean)
• You believe that the weather is a discrete Markov chain (no memory) with two states (rainy, sunny), but you cant observe them directly. You know the average weather patterns
5
Code
states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = { 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, }
emission_probability = { 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }
6
Observations
• Given (walk, shop, clean)
– What is the probability of this sequence of observations? (is he really still at home, or did he skip the country)
– What was the most likely sequence of rainy/sunny days?
7
MatrixRainy Sunny
walk .6*.1 .4*.6
shop .7*.4 .4*.4 .3*.3 .6*.3
clean .7*.5 .4*.5 .3*.1 .6*.1
Sunny, Rainy, Rainy = (.4*.6)(.4*.4)(.7*.5)
8
The CpG island problem
• Methylation in human genome
– “CG” -> “TG” happens randomly except where there is selection.
– One area of selection is the“start regions” of genes
– CpG islands = 100-1,000 bases before a gene starts
• Question
– Given a long sequence, how would we find the CpG islands in it?
9
Hidden Markov Model
CpG Island
X=ATTGATGTGAACTGGGGATCGGGCGATATATGATTGG
OtherOther
How can we identify a CpG island in a long sequence?
Idea 1: Test each window of a fixed number of nucleitidesIdea2: Classify the whole sequence Class label S1: OOOO………….……OClass label S2: OOOO…………. OCC…Class label Si: OOOO…OCC..CO…O…Class label SN: CCCC……………….CC
S*=argmaxS P(S|X)= argmaxS P(S,X)
S*=OOOO…OCC..CO…O
CpG
10
HMM is just one way of modeling p(X,S)…
11
A simple HMM
Parameters
Initial state prob: p(B)= 0.5; p(I)=0.5
State transition prob:p(BB)=0.7 p(BI)=0.3p(IB)=0.5 p(II)=0.5
Output prob:P(a|B) = 0.25,…p(c|B)=0.10…P(c|I) = 0.25 …
P(B)=0.5P(I)=0.5
P(x|B)B I
0.5
0.5P(x|I)
0.7
0.30.5
0.5
P(x|HCpG)=p(x|I)
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
P(x|HOther)=p(x|B)
P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25
12
( , , , , )HMM S V B A= Π
( ) : " "i k k ib v prob of generating v at s
A General Definition of HMM
11
{ ,..., } 1N
N ii
π π π=
Π = =∑:i iprob of starting at state sπ
1{ ,..., }MV v v=
1{ ,..., }NS s s=N states
M symbols
Initial state probability:
1
{ } 1 , 1N
ij ijj
A a i j N a=
= ≤ ≤ =∑State transition probability:
1
{ ( )} 1 , 1 ( ) 1M
i k i kk
B b v i N k M b v=
= ≤ ≤ ≤ ≤ =∑
Output probability::ij i ja prob of going s s→
13
How to “Generate” a Sequence?
B I
0.7
0.30.5
0.5
P(x|B) P(x|I)
P(B)=0.5 P(I)=0.5
B I BB BII I
I I IB BBI I… …
Given a model, follow a path to generate the observations.
model
Sequence
states
P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
a c g t t …
14
How to “Generate” a Sequence?
B I
0.7
0.30.5
0.5
P(x|B) P(x|I)
P(B)=0.5 P(I)=0.5
model
Sequence
P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
a c g t t …
a
B I BII
tgc
0.50.3
P(“BIIIB”, “acgtt”)=p(B)p(a|B) p(I|B)p(c|I) p(I|I)p(g|I) p(I|I)p(t|I) p(B|I)p(t|B)
0.50.50.5
0.40.250.250.250.25
t
15
HMM as a Probabilistic Model
1 2 1 2 1 1 1 2 1 2 2 1( , ,..., , , ,..., ) ( ) ( | ) ( | ) ( | )... ( | ) ( )T T T T T Tp O O O S S S p S p O S p S S p O S p S S p O S−=
1 2 1 2 1 1( , ,..., ) ( ) ( | )... ( | )T T Tp S S S p S p S S p S S −=
Time/Index: t1 t2 t3 t4 …Data: o1 o2 o3 o4 …
Observation variable: O1 O2 O3 O4 …
Hidden state variable: S1 S2 S3 S4 …
Random variables/process
Sequential data
Probability of observations (incomplete likelihood):
1
1 2 1 2 1,...
( , ,..., ) ( , ,..., , ,... )T
T T TS S
p O O O p O O O S S= ∑
1 2 1 2 1 1 2 2( , ,..., | , ,..., ) ( | ) ( | )... ( | )T T T Tp O O O S S S p O S p O S p O S=
Joint probability (complete likelihood):
State transition prob:
Probability of observations with known state transitions:
Init state distr.
State trans. prob.
Output prob.
16
Three Problems
1. Decoding – finding the most likely path
Have: model, parameters, observations (data)
Want: most likely states sequence
2. Evaluation – computing observation likelihood
Have: model, parameters, observations (data)
Want: the likelihood to generate the observed data
1 2
1 2 1 2...
( | ) ( | ... ) ( ... )T
T TS S S
p O p O S S S p S S Sλ = ∑
17
Three Problems (cont.) • Training – estimating parameters
- Supervised
Have: model, marked data( data+states sequence)
Want: parameters
- Unsupervised
Have: model, data
Want: parameters
* arg max ( | )p Oλλ λ=
18
Problem I: Decoding/ParsingFinding the most likely path
You can think of this as classification with all the paths as class labels…
19
What’s the most likely path?
a c t tt ag g
? ? ?? ??? ??? ?????
????????
?
1 1 11 2 1 2
* * *1 2 1 2 1
... ... 2
... arg max ( ... , ) arg max ( ) ( ) ( )i i i i
T T
T
T T S o S S S oS S S S S S i
S S S p S S S O S b v a b vπ−
=
= = ∏
B I
0.7
0.30.5
0.5
P(x|B) P(x|I)
P(B)=0.5 P(I)=0.5
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25
20
Viterbi Algorithm: An Example
B
I II I
a c …tg0.5
0.3
0.5
0.5
0.7
BBB0.7 0.7
0.5 0.50.5
0.3
0.5
0.3
0.5
t = 1 2 3 4 …
VP(B): 0.5*0.251 (B) (0.5*0.251)*0.7*0.098(BB) …
VP(I) 0.5*0.25(I) (0.5*0.25)*0.5*0.25(II) …
P(a|B)=0.251P(t|B)=0.40
P(c|B)=0.098P(g|B)=0.251
B I
0.7
0.30.5
0.5
P(x|B) P(x|I)
P(B)=0.5 P(I)=0.5
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
Remember the best paths so far
21
Viterbi Algorithm
Observation:
Algorithm:
1 2 1 2 11 1 1 1 1
... ...max ( ... , ... ) max[ max ( ... , ... , )]
T i TT T T T T i
S S S s S S Sp o o S S p o o S S S s
−−= =
1 1
1 1
1 1 1...
*1 1 1
...
( ) max ( ... , ... , )
( ) [arg max ( ... , ... , ) ] ( )
t
t
t t t t iS S
t t t t iS S
VP i p o o S S S s
q i p o o S S S s i
−
−
−
−
= =
= = →
*1 1 11. ( ) ( ), ( ) ( ), 1,...,i iVP i b o q i i for i Nπ= = =
1 11
* *1 1
1
2. 1 , ( ) max ( ) ( ),
( ) ( ) ( ), arg max ( ) ( ), 1,...,
t t ji i tj N
t t t ji i tj N
For t T VP i VP j a b o
q i q k i k VP j a b o for i N
+ +≤ ≤
+ +≤ ≤
≤ < =
= → = =
(Dynamic programming)
* ( )TThe best path is q i Complexity: O(TN2)
22
Problem II: EvaluationComputing the data likelihood
• Another use of an HMM, e.g., as a generative model for discrimination
• Also related to Problem III – parameter estimation
23
Data Likelihood: p(O|λ)
(" ..." | ) (" ..." | ... ) ( ... )
(" ..." | ... ) ( ... )
... (" ..." | ... ) ( ... )
p a c g t p a c g t BB B p BB B
p a c g t BT B p BT B
p a c g t TT T p TT T
λ =++ +
B
I II I
a c …tg0.5
0.3
0.5
0.5
0.7
BBB0.7 0.7
0.5 0.5
0.5
0.5
0.3
0.5
0.3
0.5
t = 1 2 3 4 …
In general, 1 2
1 2 1 2...
( | ) ( | ... ) ( ... )T
T TS S S
p O p O S S S p S S Sλ = ∑
All HMM parameters
24
The Forward Algorithm
1 2 1
1 2 1
1 2 1
1 1 1 2 11 ...
1 1 2 1...
1 1 1 2 1 1...
1 1 1 2 1
( ... | ) ( ... , ... , )
( ) ( ... , ... , )
( ... , ... ) ( | ) ( | )
[ ( ... , ... )] (
T
t
t
N
T T T T ii S S S
t t t t iS S S
t t t i t t iS S S
t t j ji i
p o o p o o S S S S s
i p o o S S S S s
p o o S S S p S s S p o S s
p o o S S S s a b
λ
α−
−
−
−=
−
− − −
− −
= =
= =
= = =
= =
∑ ∑
∑
∑
1 2 21 ...
11
)
( ) ( )
t
N
tj S S S
N
i t t jij
o
b o j aα
−=
−=
=
∑ ∑
∑
11
( ... | ) ( )N
T Ti
p o o iλ α=
=∑The data likelihood is
Observation:
Algorithm:
Generating o1…ot
with ending state si
Complexity: O(TN2)
25
Forward Algorithm: Example
B
I II I
a c …tg0.5
0.3
0.5
0.5
0.7
BBB0.7 0.7
0.5 0.50.5
0.3
0.5
0.3
0.5
t = 1 2 3 4 …
α1(B): 0.5*p(“a”|B)
α1(I): 0.5*p(“a”|I)
α2(B): [α1(B)*0.8+ α1(I)*0.5]*p(“c”|B) ……
α2(I): [α1(B)*0.2+ α1(I)*0.5]*p(“c”|I) ……
1 11 1
( ... | ) ( ) ( ) ( ) ( )N N
T T t i t t jii j
p o o i i b o j aλ α α α −= =
= =∑ ∑
P(“a c g t”) = α4(B)+ α4(I)
26
The Backward Algorithm
2
2
1
1
1 1 1 21 ...
1 2 2 11 ...
1 1...
2 2 1 1 1 1...
( ... | ) ( ... , , ... )
( ) ( ... , ... | )
( ) ( ... , ... | )
( ... , ... | ) ( | ) ( |
T
T
t T
t
N
T T i Ti S S
N
i i T T ii S S
t t T t T t iS S
t T t T t t t i t tS S
p o o p o o S s S S
b o p o o S S S s
i p o o S S S s
p o o S S S p S S s p o S
λ
π
β+
+
=
=
+ +
+ + + + + +
= =
= =
= =
= =
∑ ∑
∑ ∑
∑
2
1 2 2 11 ...
1 11
)
( ) ( ... , ... | )
( ) ( )
T
t T
N
ij j t t T t T t jj S S
N
ij j t tj
a b o p o o S S S s
a b o jβ
+
+ + + +=
+ +=
= =
=
∑
∑ ∑
∑
The data likelihood is
Observation:
Algorithm: Starting from state si
Generating ot+1…oT
(o1…ot already generated)
Complexity: O(TN2)
27
Backward Algorithm: Example
B
I II I
a c …tg0.5
0.3
0.5
0.5
0.7
BBB0.7 0.7
0.5 0.50.5
0.3
0.5
0.3
0.5
t = 1 2 3 4 …
β4(B): 1 β3(B): 0.7*p(“t”|B)*β4(B)+ 0.3*p(“t”|I)*β4(I)
1 11 1
( ... | ) ( ) ( ) ( ) ( )N N
T T t i t t jii j
p o o i i b o j aλ α α α −= =
= =∑ ∑
P(“a c g t”) = α 1(B)*β 1(B)+ α 1(I)*β 1(I) = α 2(B)*β 2(B)+ α 2(I)*β 2(I)
β4(I): 1 β3(I): 0.5*p(“t”|B)*β4(B)+ 0.5*p(“t”|T)*β4(I)
……
28
Problem III: TrainingEstimating Parameters
Where do we get the probability values for all parameters?
Supervised vs. Unsupervised
29
Supervised TrainingGiven:
1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222
Task: Estimate the following parameters1. 1, 2
2. a11, a12,a22,a21
3. b1(a), b1(b), b2(a), b2(b)
π1=1/1=1; π2=0/1=0
b1(a)=4/4=1.0; b1(b)=0/4=0;b2(a)=1/6=0.167; b2(b)=5/6=0.833
a11=2/4=0.5; a12=2/4=0.5a21=1/5=0.2; a22=4/5=0.8
1 20.5
0.8
0.2
0.5
P(s1)=1P(s2)=0
P(a|s1)=1P(b|s1)=0
P(a|s2)=167P(b|s2)=0.833
30
Unsupervised TrainingGiven:
1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222
Task: Estimate the following parameters1. 1, 2
2. a11, a12,a22,a21
3. b1(a), b1(b), b2(a), b2(b)
How could this be possible?
Maximum Likelihood:* arg max ( | )p Oλλ λ=
31
Intuition
1
1
( , | ) [ (1) 1]
( , | )
K
k kk
i K
kk
p O q q
p O q
λ δπ
λ
=
=
==∑
∑
1
1 11
1 1
( , | ) [ ( ) , ( 1) ]
( , | ) [ ( ) ]
T K
k k kt k
ij T K
k kt k
p O q q t i q t ja
p O q q t i
λ δ
λ δ
−
= =−
= =
= + ==
=
∑∑
∑∑
O=aaaaabbbbb,
q1=1111111111
P(O,q1|λ)
q2=11111112211 … qK=2222222222
P(O,q2|λ)P(O,qK|λ)
1
1 11
1 1
( , | ) [ ( ) , ]( )
( , | ) [ ( ) ]
T K
k k t jt k
i j T K
k kt k
p O q q t i o vb v
p O q q t i
λ δ
λ δ
−
= =−
= =
= ==
=
∑∑
∑∑
New ’
Computation of P(O,qk|λ) is expensive …
i
32
Baum-Welch Algorithm
( ) ( ( ) | , )
( , ) ( ( ) , ( 1) | , )t i
t i j
i p q t s O
i j p q t s q t s O
γ λξ λ
= == = + =
Basic “counters”: Being at state si at time t
Being at state si at time t and at state sj at time t+1
Complexity: O(N2)1
1 1
1
1 1
( ) ( )( )
( ) ( )
( ) ( ) ( )( , )
( ) ( )
( ) ( )( )
( )
t tt N
t tj
t ij j t tt N
t tj
ij j t tt
t
i ii
j j
i a b o ji j
j j
a b o ji
i
α βγα β
α βξ
α β
βγ
β
=
+ +
=
+ +
=
=
=
∑
∑
Computation of counters:
33
Baum-Welch Algorithm (cont.)
Updating formulas:
'1
1
' 11
' 1 1
1
1
( )
( , )
( , ')
( ) [ ]( )
( )
i
T
tt
ij N T
tj t
T
t t kt
i k T
tt
i
i ja
i j
i o vb v
i
π γ
ξ
ξ
γ δ
γ
−
=−
= =
=
=
=
=
==
∑
∑∑
∑
∑
Overall complexity for each iteration: O(TN2)
34
Next Time
• Tutorial
• Posted on blackboard