1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the...

1

Hidden Markov Models (HMMs)

2

Definition

• Hidden Markov Model is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters.

• The challenge is to determine the hidden parameters from the observable parameters.

3

State Transitions

Markov Model Example. --x = States of the Markov model -- a = Transition probabilities -- b = Output probabilities -- y = Observable outputs

-How does this differ from a Finite State machine?

4

Example

• Distant friend that you talk to daily about his activities (walk, shop, clean)

• You believe that the weather is a discrete Markov chain (no memory) with two states (rainy, sunny), but you cant observe them directly. You know the average weather patterns

5

Code

states = ('Rainy', 'Sunny')

observations = ('walk', 'shop', 'clean')

start_probability = {'Rainy': 0.6, 'Sunny': 0.4}

transition_probability = { 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, }

emission_probability = { 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }

6

Observations

• Given (walk, shop, clean)

– What is the probability of this sequence of observations? (is he really still at home, or did he skip the country)

– What was the most likely sequence of rainy/sunny days?

7

MatrixRainy Sunny

walk .6*.1 .4*.6

shop .7*.4 .4*.4 .3*.3 .6*.3

clean .7*.5 .4*.5 .3*.1 .6*.1

Sunny, Rainy, Rainy = (.4*.6)(.4*.4)(.7*.5)

8

The CpG island problem

• Methylation in human genome

– “CG” -> “TG” happens randomly except where there is selection.

– One area of selection is the“start regions” of genes

– CpG islands = 100-1,000 bases before a gene starts

• Question

– Given a long sequence, how would we find the CpG islands in it?

9

Hidden Markov Model

CpG Island

X=ATTGATGTGAACTGGGGATCGGGCGATATATGATTGG

OtherOther

How can we identify a CpG island in a long sequence?

Idea 1: Test each window of a fixed number of nucleitidesIdea2: Classify the whole sequence Class label S1: OOOO………….……OClass label S2: OOOO…………. OCC…Class label Si: OOOO…OCC..CO…O…Class label SN: CCCC……………….CC

S*=argmaxS P(S|X)= argmaxS P(S,X)

S*=OOOO…OCC..CO…O

CpG

10

HMM is just one way of modeling p(X,S)…

11

A simple HMM

Parameters

Initial state prob: p(B)= 0.5; p(I)=0.5

State transition prob:p(BB)=0.7 p(BI)=0.3p(IB)=0.5 p(II)=0.5

Output prob:P(a|B) = 0.25,…p(c|B)=0.10…P(c|I) = 0.25 …

P(B)=0.5P(I)=0.5

P(x|B)B I

0.5

0.5P(x|I)

0.7

0.30.5

0.5

P(x|HCpG)=p(x|I)

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

P(x|HOther)=p(x|B)

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

12

( , , , , )HMM S V B A= Π

( ) : " "i k k ib v prob of generating v at s

A General Definition of HMM

11

{ ,..., } 1N

N ii

π π π=

Π = =∑:i iprob of starting at state sπ

1{ ,..., }MV v v=

1{ ,..., }NS s s=N states

M symbols

Initial state probability:

1

{ } 1 , 1N

ij ijj

A a i j N a=

= ≤ ≤ =∑State transition probability:

1

{ ( )} 1 , 1 ( ) 1M

i k i kk

B b v i N k M b v=

= ≤ ≤ ≤ ≤ =∑

Output probability::ij i ja prob of going s s→

13

How to “Generate” a Sequence?

B I

0.7

0.30.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

B I BB BII I

I I IB BBI I… …

Given a model, follow a path to generate the observations.

model

Sequence

states

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

a c g t t …

14

How to “Generate” a Sequence?

B I

0.7

0.30.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

model

Sequence

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

a c g t t …

a

B I BII

tgc

0.50.3

P(“BIIIB”, “acgtt”)=p(B)p(a|B) p(I|B)p(c|I) p(I|I)p(g|I) p(I|I)p(t|I) p(B|I)p(t|B)

0.50.50.5

0.40.250.250.250.25

t

15

HMM as a Probabilistic Model

1 2 1 2 1 1 1 2 1 2 2 1( , ,..., , , ,..., ) ( ) ( | ) ( | ) ( | )... ( | ) ( )T T T T T Tp O O O S S S p S p O S p S S p O S p S S p O S−=

1 2 1 2 1 1( , ,..., ) ( ) ( | )... ( | )T T Tp S S S p S p S S p S S −=

Time/Index: t1 t2 t3 t4 …Data: o1 o2 o3 o4 …

Observation variable: O1 O2 O3 O4 …

Hidden state variable: S1 S2 S3 S4 …

Random variables/process

Sequential data

Probability of observations (incomplete likelihood):

1

1 2 1 2 1,...

( , ,..., ) ( , ,..., , ,... )T

T T TS S

p O O O p O O O S S= ∑

1 2 1 2 1 1 2 2( , ,..., | , ,..., ) ( | ) ( | )... ( | )T T T Tp O O O S S S p O S p O S p O S=

Joint probability (complete likelihood):

State transition prob:

Probability of observations with known state transitions:

Init state distr.

State trans. prob.

Output prob.

16

Three Problems

1. Decoding – finding the most likely path

Have: model, parameters, observations (data)

Want: most likely states sequence

2. Evaluation – computing observation likelihood

Have: model, parameters, observations (data)

Want: the likelihood to generate the observed data

1 2

1 2 1 2...

( | ) ( | ... ) ( ... )T

T TS S S

p O p O S S S p S S Sλ = ∑

17

Three Problems (cont.) • Training – estimating parameters

- Supervised

Have: model, marked data( data+states sequence)

Want: parameters

- Unsupervised

Have: model, data

Want: parameters

* arg max ( | )p Oλλ λ=

18

Problem I: Decoding/ParsingFinding the most likely path

You can think of this as classification with all the paths as class labels…

19

What’s the most likely path?

a c t tt ag g

? ? ?? ??? ??? ?????

????????

?

1 1 11 2 1 2

* * *1 2 1 2 1

... ... 2

... arg max ( ... , ) arg max ( ) ( ) ( )i i i i

T T

T

T T S o S S S oS S S S S S i

S S S p S S S O S b v a b vπ−

=

= = ∏

B I

0.7

0.30.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

20

Viterbi Algorithm: An Example

B

I II I

a c …tg0.5

0.3

0.5

0.5

0.7

BBB0.7 0.7

0.5 0.50.5

0.3

0.5

0.3

0.5

t = 1 2 3 4 …

VP(B): 0.5*0.251 (B) (0.5*0.251)*0.7*0.098(BB) …

VP(I) 0.5*0.25(I) (0.5*0.25)*0.5*0.25(II) …

P(a|B)=0.251P(t|B)=0.40

P(c|B)=0.098P(g|B)=0.251

B I

0.7

0.30.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

Remember the best paths so far

21

Viterbi Algorithm

Observation:

Algorithm:

1 2 1 2 11 1 1 1 1

... ...max ( ... , ... ) max[ max ( ... , ... , )]

T i TT T T T T i

S S S s S S Sp o o S S p o o S S S s

−−= =

1 1

1 1

1 1 1...

*1 1 1

...

( ) max ( ... , ... , )

( ) [arg max ( ... , ... , ) ] ( )

t

t

t t t t iS S

t t t t iS S

VP i p o o S S S s

q i p o o S S S s i

−

−

−

−

= =

= = →

*1 1 11. ( ) ( ), ( ) ( ), 1,...,i iVP i b o q i i for i Nπ= = =

1 11

* *1 1

1

2. 1 , ( ) max ( ) ( ),

( ) ( ) ( ), arg max ( ) ( ), 1,...,

t t ji i tj N

t t t ji i tj N

For t T VP i VP j a b o

q i q k i k VP j a b o for i N

+ +≤ ≤

+ +≤ ≤

≤ < =

= → = =

(Dynamic programming)

* ( )TThe best path is q i Complexity: O(TN2)

22

Problem II: EvaluationComputing the data likelihood

• Another use of an HMM, e.g., as a generative model for discrimination

• Also related to Problem III – parameter estimation

23

Data Likelihood: p(O|λ)

(" ..." | ) (" ..." | ... ) ( ... )

(" ..." | ... ) ( ... )

... (" ..." | ... ) ( ... )

p a c g t p a c g t BB B p BB B

p a c g t BT B p BT B

p a c g t TT T p TT T

λ =++ +

B

I II I

a c …tg0.5

0.3

0.5

0.5

0.7

BBB0.7 0.7

0.5 0.5

0.5

0.5

0.3

0.5

0.3

0.5

t = 1 2 3 4 …

In general, 1 2

1 2 1 2...

( | ) ( | ... ) ( ... )T

T TS S S

p O p O S S S p S S Sλ = ∑

All HMM parameters

24

The Forward Algorithm

1 2 1

1 2 1

1 2 1

1 1 1 2 11 ...

1 1 2 1...

1 1 1 2 1 1...

1 1 1 2 1

( ... | ) ( ... , ... , )

( ) ( ... , ... , )

( ... , ... ) ( | ) ( | )

[ ( ... , ... )] (

T

t

t

N

T T T T ii S S S

t t t t iS S S

t t t i t t iS S S

t t j ji i

p o o p o o S S S S s

i p o o S S S S s

p o o S S S p S s S p o S s

p o o S S S s a b

λ

α−

−

−

−=

−

− − −

− −

= =

= =

= = =

= =

∑ ∑

∑

∑

1 2 21 ...

11

)

( ) ( )

t

N

tj S S S

N

i t t jij

o

b o j aα

−=

−=

=

∑ ∑

∑

11

( ... | ) ( )N

T Ti

p o o iλ α=

=∑The data likelihood is

Observation:

Algorithm:

Generating o1…ot

with ending state si

Complexity: O(TN2)

25

Forward Algorithm: Example

B

I II I

a c …tg0.5

0.3

0.5

0.5

0.7

BBB0.7 0.7

0.5 0.50.5

0.3

0.5

0.3

0.5

t = 1 2 3 4 …

α1(B): 0.5*p(“a”|B)

α1(I): 0.5*p(“a”|I)

α2(B): [α1(B)*0.8+ α1(I)*0.5]*p(“c”|B) ……

α2(I): [α1(B)*0.2+ α1(I)*0.5]*p(“c”|I) ……

1 11 1

( ... | ) ( ) ( ) ( ) ( )N N

T T t i t t jii j

p o o i i b o j aλ α α α −= =

= =∑ ∑

P(“a c g t”) = α4(B)+ α4(I)

26

The Backward Algorithm

2

2

1

1

1 1 1 21 ...

1 2 2 11 ...

1 1...

2 2 1 1 1 1...

( ... | ) ( ... , , ... )

( ) ( ... , ... | )

( ) ( ... , ... | )

( ... , ... | ) ( | ) ( |

T

T

t T

t

N

T T i Ti S S

N

i i T T ii S S

t t T t T t iS S

t T t T t t t i t tS S

p o o p o o S s S S

b o p o o S S S s

i p o o S S S s

p o o S S S p S S s p o S

λ

π

β+

+

=

=

+ +

+ + + + + +

= =

= =

= =

= =

∑ ∑

∑ ∑

∑

2

1 2 2 11 ...

1 11

)

( ) ( ... , ... | )

( ) ( )

T

t T

N

ij j t t T t T t jj S S

N

ij j t tj

a b o p o o S S S s

a b o jβ

+

+ + + +=

+ +=

= =

=

∑

∑ ∑

∑

The data likelihood is

Observation:

Algorithm: Starting from state si

Generating ot+1…oT

(o1…ot already generated)

Complexity: O(TN2)

27

Backward Algorithm: Example

B

I II I

a c …tg0.5

0.3

0.5

0.5

0.7

BBB0.7 0.7

0.5 0.50.5

0.3

0.5

0.3

0.5

t = 1 2 3 4 …

β4(B): 1 β3(B): 0.7*p(“t”|B)*β4(B)+ 0.3*p(“t”|I)*β4(I)

1 11 1

( ... | ) ( ) ( ) ( ) ( )N N

T T t i t t jii j

p o o i i b o j aλ α α α −= =

= =∑ ∑

P(“a c g t”) = α 1(B)*β 1(B)+ α 1(I)*β 1(I) = α 2(B)*β 2(B)+ α 2(I)*β 2(I)

β4(I): 1 β3(I): 0.5*p(“t”|B)*β4(B)+ 0.5*p(“t”|T)*β4(I)

……

28

Problem III: TrainingEstimating Parameters

Where do we get the probability values for all parameters?

Supervised vs. Unsupervised

29

Supervised TrainingGiven:

1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222

Task: Estimate the following parameters1. 1, 2

2. a11, a12,a22,a21

3. b1(a), b1(b), b2(a), b2(b)

π1=1/1=1; π2=0/1=0

b1(a)=4/4=1.0; b1(b)=0/4=0;b2(a)=1/6=0.167; b2(b)=5/6=0.833

a11=2/4=0.5; a12=2/4=0.5a21=1/5=0.2; a22=4/5=0.8

1 20.5

0.8

0.2

0.5

P(s1)=1P(s2)=0

P(a|s1)=1P(b|s1)=0

P(a|s2)=167P(b|s2)=0.833

30

Unsupervised TrainingGiven:

1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222

Task: Estimate the following parameters1. 1, 2

2. a11, a12,a22,a21

3. b1(a), b1(b), b2(a), b2(b)

How could this be possible?

Maximum Likelihood:* arg max ( | )p Oλλ λ=

31

Intuition

1

1

( , | ) [ (1) 1]

( , | )

K

k kk

i K

kk

p O q q

p O q

λ δπ

λ

=

=

==∑

∑

1

1 11

1 1

( , | ) [ ( ) , ( 1) ]

( , | ) [ ( ) ]

T K

k k kt k

ij T K

k kt k

p O q q t i q t ja

p O q q t i

λ δ

λ δ

−

= =−

= =

= + ==

=

∑∑

∑∑

O=aaaaabbbbb,

q1=1111111111

P(O,q1|λ)

q2=11111112211 … qK=2222222222

P(O,q2|λ)P(O,qK|λ)

1

1 11

1 1

( , | ) [ ( ) , ]( )

( , | ) [ ( ) ]

T K

k k t jt k

i j T K

k kt k

p O q q t i o vb v

p O q q t i

λ δ

λ δ

−

= =−

= =

= ==

=

∑∑

∑∑

New ’

Computation of P(O,qk|λ) is expensive …

i

32

Baum-Welch Algorithm

( ) ( ( ) | , )

( , ) ( ( ) , ( 1) | , )t i

t i j

i p q t s O

i j p q t s q t s O

γ λξ λ

= == = + =

Basic “counters”: Being at state si at time t

Being at state si at time t and at state sj at time t+1

Complexity: O(N2)1

1 1

1

1 1

( ) ( )( )

( ) ( )

( ) ( ) ( )( , )

( ) ( )

( ) ( )( )

( )

t tt N

t tj

t ij j t tt N

t tj

ij j t tt

t

i ii

j j

i a b o ji j

j j

a b o ji

i

α βγα β

α βξ

α β

βγ

β

=

+ +

=

+ +

=

=

=

∑

∑

Computation of counters:

33

Baum-Welch Algorithm (cont.)

Updating formulas:

'1

1

' 11

' 1 1

1

1

( )

( , )

( , ')

( ) [ ]( )

( )

i

T

tt

ij N T

tj t

T

t t kt

i k T

tt

i

i ja

i j

i o vb v

i

π γ

ξ

ξ

γ δ

γ

−

=−

= =

=

=

=

=

==

∑

∑∑

∑

∑

Overall complexity for each iteration: O(TN2)

34

Next Time

• Tutorial

• Posted on blackboard

1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the...

Documents

Transcript of 1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the...