Hidden Markov Models (HMMs)

33
1 Hidden Markov Models (HMMs) (Lecture for CS410 Text Information Systems) April 24, 2009 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

description

Hidden Markov Models (HMMs). (Lecture for CS410 Text Information Systems) April 24, 2009 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Modeling a Multi-Topic Document. A document with 2 subtopics, thus 2 types of vocabulary. … - PowerPoint PPT Presentation

Transcript of Hidden Markov Models (HMMs)

Page 1: Hidden Markov Models (HMMs)

1

Hidden Markov Models (HMMs)

(Lecture for CS410 Text Information Systems)

April 24, 2009

ChengXiang ZhaiDepartment of Computer Science

University of Illinois, Urbana-Champaign

Page 2: Hidden Markov Models (HMMs)

2

Modeling a Multi-Topic Document

…text mining passage

food nutrition passage

text mining passage

text mining passage

food nutrition passage …

A document with 2 subtopics, thus 2 types of vocabulary

How do we model such a document?

How do we “generate” such a document?

How do we estimate our model?

We’ve already seen one solution – unigram mixture model + EMThis lecture is about another (better) solution – HMM + EM

Page 3: Hidden Markov Models (HMMs)

3

Simple Unigram Mixture Model…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Model/topic 1p(w|1)

…food 0.25nutrition 0.1healthy 0.05diet 0.02…

Model/topic 2p(w|2)

=0.7

1-=0.3

p(w|1 2) = p(w|1)+(1- )p(w|2)

Page 4: Hidden Markov Models (HMMs)

4

Deficiency of the Simple Mixture Model

•Adjacent words are sampled independently•Cannot ensure passage cohesion•Cannot capture the dependency between

neighboring wordsWe apply the text mining algorithm to the nutrition data to find patterns …

Topic 1= text miningp(w|1)

Topic 1= text miningp(w|1)

Topic 2= healthp(w|2)

Solution=?

Page 5: Hidden Markov Models (HMMs)

5

•Basic idea: – Make the choice of model for a word

depend on the choice of model for the previous word

– Thus we model both the choice of model (“state”) and the words (“observations”)

The Hidden Markov Model Solution

O: We apply the text mining algorithm to the nutrition data to find patterns … S1: 1 1 1 1 1 1 1 1 1 1 1 1 1

……O: We apply the text mining algorithm to the nutrition data to find patterns …

S2: 1 1 1 1 1 1 1 1 2 1 1 1 1

P(O,S1) > p(O,S2)?

Page 6: Hidden Markov Models (HMMs)

6

RelevantPassages

(“T”)

Background text (“B”)

Strategy I: Build passage index Need pre-processing Query independent boundaries Efficient

Strategy II: Extract relevant passages dynamically

No pre-processing Dynamically decide the boundaries Slow

Another Example: Passage Retrieval

O: We apply the text mining algorithm to the nutrition data to find patterns … S1: B T B T T T B B T T B T T

O:…......... ….. We apply the text mining … patterns. ……………… S2: BB...BB T T T T T …. T BB….. BBBB

Page 7: Hidden Markov Models (HMMs)

7

A Simple HMM for Recognizing Relevant Passages

P(w|B)

the 0.2a 0.1we 0.01to 0.02…text 0.0001mining 0.00005

…text =0.125 mining = 0.1association =0.1clustering = 0.05

Parameters

Initial state prob: p(B)= 0.5; p(T)=0.5

State transition prob:p(BB)=0.8 p(BT)=0.2p(TB)=0.5 p(TT)=0.5

Output prob:P(“the”|B) = 0.2 …P(“text”|T) = 0.1 …

P(B)=0.5P(T)=0.5

P(w|B)

B T

0.8

0.20.5

0.5 P(w|T)

B T

0.8

0.20.5

0.5 P(w|T)

BackgroundTopic

Page 8: Hidden Markov Models (HMMs)

8

( , , , , )HMM S V B A

( ) : " "i k k ib v prob of generating v at s

A General Definition of HMM

11

{ ,..., } 1N

N ii

:i iprob of starting at state s

1{ ,..., }MV v v

1{ ,..., }NS s sN states

M symbols

Initial state probability:

1

{ } 1 , 1N

ij ijj

A a i j N a

State transition probability:

1

{ ( )} 1 , 1 ( ) 1M

i k i kk

B b v i N k M b v

Output probability::ij i ja prob of going s s

Page 9: Hidden Markov Models (HMMs)

9

How to “Generate” Text?

B T

0.8

0.20.5

0.5

P(w|B) P(w|T)the 0.2a 0.1we 0.01is 0.02…method 0.0001mining 0.00005

…text =0.125 mining = 0.1algorithm =0.1clustering = 0.05

the

P(B)=0.5 P(T)=0.5

text is clusteringalgorithm amining method

B T BB BTT T0.50.2 0.20.80.50.50.5

0.00010.050.10.020.10.10.1250.2

0.5

P(BTT…,”the text mining…”)=p(B)p(the|B) p(T|B)p(text|T) p(T|T)p(mining|T)… = 0.5*0.2 * 0.2*0.125 * 0.5*0.1…

Page 10: Hidden Markov Models (HMMs)

10

HMM as a Probabilistic Model

1 2 1 2 1 1 1 2 1 2 2 1( , ,..., , , ,..., ) ( ) ( | ) ( | ) ( | )... ( | ) ( )T T T T T Tp O O O S S S p S p O S p S S p O S p S S p O S

1 2 1 2 1 1( , ,..., ) ( ) ( | )... ( | )T T Tp S S S p S p S S p S S

Time/Index: t1 t2 t3 t4 …Data: o1 o2 o3 o4 …

Observation variable: O1 O2 O3 O4 …

Hidden state variable: S1 S2 S3 S4 …

Random variables/process

Sequential data

Probability of observations (incomplete likelihood):

1

1 2 1 2 1,...

( , ,..., ) ( , ,..., , ,... )T

T T TS S

p O O O p O O O S S

1 2 1 2 1 1 2 2( , ,..., | , ,..., ) ( | ) ( | )... ( | )T T T Tp O O O S S S p O S p O S p O S

Joint probability (complete likelihood):

State transition prob:

Probability of observations with known state transitions:

Init state distr.

State trans. prob.

Output prob.

Page 11: Hidden Markov Models (HMMs)

11

Three Problems1. Decoding – finding the most likely path

Given: model, parameters, observations (data)

Find: most likely states sequence (path)

2. Evaluation – computing observation likelihood

Given: model, parameters, observations (data)

Find: the likelihood to generate the data

1 2 1 2

* * *1 2 1 2 1 2

... ...... arg max ( ... | ) arg max ( ... , )

T T

T T TS S S S S S

S S S p S S S O p S S S O

1 2

1 2 1 2...

( | ) ( | ... ) ( ... )T

T TS S S

p O p O S S S p S S S

Page 12: Hidden Markov Models (HMMs)

12

Three Problems (cont.) 3. Training – estimating parameters

- Supervised Given: model structure, labeled

data( data+states sequence) Find: parameter values - Unsupervised Given: model structure, data (unlabeled) Find: parameter values

* arg max ( | )p O

Page 13: Hidden Markov Models (HMMs)

13

Problem I: Decoding/Parsing

Finding the most likely path

This is the most common way of using an HMM (e.g., extraction, structure analysis)…

Page 14: Hidden Markov Models (HMMs)

14

What’s the most likely path?

B T

0.8

0.20.5

0.5

P(w|B) P(w|T)the 0.2a 0.1we 0.01is 0.02…method 0.0001mining 0.00005

…text =0.125 mining = 0.1algorithm =0.1clustering = 0.05

the

P(B)=0.5 P(T)=0.5

text is clusteringalgorithm amining method

? ? ?? ??? ??? ?????

????????

?

1 1 11 2 1 2

* * *1 2 1 2 1

... ... 2

... arg max ( ... , ) arg max ( ) ( ) ( )i i i i

T T

T

T T S o S S S oS S S S S S i

S S S p S S S O S b v a b v

Page 15: Hidden Markov Models (HMMs)

15

Viterbi Algorithm: An Example

B T

0.8

0.2

0.5

0.5

P(w|B) P(w|T)the 0.2a 0.1we 0.01…algorithm 0.0001mining 0.005text 0.001

…the 0.05text =0.125 mining = 0.1algorithm =0.1clustering = 0.05

B

T TT T

P(B)=0.5 P(T)=0.5

the text …algorithmmining0.5

0.2

0.50.5

0.8BBB

0.8 0.8

0.5 0.5

0.5

0.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

VP(B): 0.5*0.2 (B) 0.5*0.2*0.8*0.001(BB) … *0.5*0.005 (BTB) …*0.5*0.0001(BTTB)

VP(T) 0.5*0.05(T) 0.5*0.2*0.2*0.125(BT) … *0.5*0.1 (BTT) …*0.5*0.1(BTTT) Winning path!

Page 16: Hidden Markov Models (HMMs)

16

Viterbi AlgorithmObservation:

Algorithm:

1 2 1 2 11 1 1 1 1... ...

max ( ... , ... ) max[ max ( ... , ... , )]T i T

T T T T T iS S S s S S Sp o o S S p o o S S S s

1 1

1 1

1 1 1...

*1 1 1

...

( ) max ( ... , ... , )

( ) [arg max ( ... , ... , ) ] ( )t

t

t t t t iS S

t t t t iS S

VP i p o o S S S s

q i p o o S S S s i

*1 1 11. ( ) ( ), ( ) ( ), 1,...,i iVP i b o q i i for i N

1 11

* *1 1

1

2. 1 , ( ) max ( ) ( ),

( ) ( ) ( ), arg max ( ) ( ), 1,...,

t t ji i tj N

t t t ji i tj N

For t T VP i VP j a b o

q i q k i k VP j a b o for i N

(Dynamic programming)

* ( )TThe best path is q i Complexity: O(TN2)

Page 17: Hidden Markov Models (HMMs)

17

Problem II: EvaluationComputing the data

likelihood Another use of an HMM, e.g., as a generative model for classificationAlso related to Problem III – parameter estimation

Page 18: Hidden Markov Models (HMMs)

18

Data Likelihood: p(O|)

(" ..." | ) (" ..." | ... ) ( ... )(" ..." | ... ) ( ... )

... (" ..." | ... ) ( ... )

p the text p the text BB B p BB Bp the text BT B p BT B

p the text TT T p TT T

B

T TT T

the text …algorithmmining0.5

0.2

0.50.5

0.8BBB

0.8 0.8

0.5 0.5

0.5

0.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

In general, 1 2

1 2 1 2...

( | ) ( | ... ) ( ... )T

T TS S S

p O p O S S S p S S S

Complexity of a naïve approach?

BB… BBT… BTT… T

enumerate all paths

Page 19: Hidden Markov Models (HMMs)

19

The Forward AlgorithmB

T TT T

…BBB

α1

Page 20: Hidden Markov Models (HMMs)

20

The Forward AlgorithmB

T TT T

…BBB

α1 α2

Page 21: Hidden Markov Models (HMMs)

21

The Forward Algorithm

11

( ... | ) ( )N

T Ti

p o o i

The data likelihood is

Complexity: O(TN2)

B

T TT T

…BBB

1 2 1

1 2 1

1 2 1

1 1 1 2 11 ...

1 1 2 1...

1 1 1 2 1 1...

1 1 1 2 1

( ... | ) ( ... , ... , )

( ) ( ... , ... , )

( ... , ... ) ( | ) ( | )

[ ( ... , ... )] (

T

t

t

N

T T T T ii S S S

t t t t iS S S

t t t i t t iS S S

t t j ji i

p o o p o o S S S S s

i p o o S S S S s

p o o S S S p S s S p o S s

p o o S S S s a b

1 2 21 ...

11

)

( ) ( )

t

N

tj S S S

N

i t t jij

o

b o j a

Generating o1…ot

with ending state si

α1 α2 α3

α4

Page 22: Hidden Markov Models (HMMs)

22

Forward Algorithm: Example

B

T TT T

the text …algorithmmining0.5

0.2

0.50.5

0.8BBB

0.8 0.8

0.5 0.50.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

1(B): 0.5*p(“the”|B)

1(T): 0.5*p(“the”|T)

2(B): [1(B)*0.8+ 1(T)*0.5]*p(“text”|B) ……

2(T): [1(B)*0.2+ 1(T)*0.5]*p(“text”|T) ……

1 11 1

( ... | ) ( ) ( ) ( ) ( )N N

T T t i t t jii j

p o o i i b o j a

P(“the text mining algorithm”) = 4(B)+ 4(T)

Page 23: Hidden Markov Models (HMMs)

23

The Backward Algorithm

2

2

1

1

1 1 1 21 ...

1 2 2 11 ...

1 1...

2 2 1 1 1 1...

( ... | ) ( ... , , ... )

( ) ( ... , ... | )

( ) ( ... , ... | )

( ... , ... | ) ( | ) ( |

T

T

t T

t

N

T T i Ti S S

N

i i T T ii S S

t t T t T t iS S

t T t T t t t i t tS S

p o o p o o S s S S

b o p o o S S S s

i p o o S S S s

p o o S S S p S S s p o S

2

1 2 2 11 ...

1 11

)

( ) ( ... , ... | )

( ) ( )

T

t T

N

ij j t t T t T t jj S S

N

ij j t tj

a b o p o o S S S s

a b o j

1 1 1 1 11 1 1

( ... | ) ( ) ( ) ( ) ( ) ( ) ( )N N N

T i i t ti i i

p o o b o i i i i i for any t

The data likelihood is

Observation:

Algorithm: Starting from state si

Generating ot+1…oT

(o1…ot already generated)

Complexity: O(TN2)

Page 24: Hidden Markov Models (HMMs)

24

Backward Algorithm: Example

B

T TT T

the text …algorithmmining0.5

0.2

0.50.5

0.8BBB

0.8 0.8

0.5 0.50.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

4(B): 1 3(B): 0.8*p(“alg”|B)*4(B)+ 0.2*p(“alg”|T)*4(T)

P(“the text mining algorithm”) = 1(B)* 1(B)+ 1(T)* 1(T) = 2(B)* 2(B)+ 2(T)* 2(T)

4(T): 1 3(T): 0.5*p(“alg”|B)*4(B)+ 0.5*p(“alg”|T)*4(T)

……

1 1 1 1 11 1 1

( ... | ) ( ) ( ) ( ) ( ) ( ) ( )N N N

T i i t ti i i

p o o b o i i i i i for any t

Page 25: Hidden Markov Models (HMMs)

25

Problem III: TrainingEstimating ParametersWhere do we get the probability values

for all parameters?Supervised vs. Unsupervised

Page 26: Hidden Markov Models (HMMs)

26

Supervised TrainingGiven:

1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222

Task: Estimate the following parameters1. 1, 2

2. a11, a12, a22, a21

3. b1(a), b1(b), b2(a), b2(b)b1(a)=4/4=1.0; b1(b)=0/4=0;b2(a)=1/6=0.167; b2(b)=5/6=0.833

a11=2/4=0.5; a12=2/4=0.5a21=1/5=0.2; a22=4/5=0.8

1=1/1=1; 2=0/1=0

1 20.5

0.8

0.2

0.5

P(s1)=1P(s2)=0

P(a|s1)=1P(b|s1)=0

P(a|s2)=167P(b|s2)=0.833

Page 27: Hidden Markov Models (HMMs)

27

Unsupervised TrainingGiven:

1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222

Task: Estimate the following parameters1. 1, 2

2. a11, a12, a22, a21

3. b1(a), b1(b), b2(a), b2(b)

How could this be possible?

Maximum Likelihood:* arg max ( | )p O

Page 28: Hidden Markov Models (HMMs)

28

Intuition

1

1

( , | ) [ (1) 1]

( , | )

K

k kk

i K

kk

p O q q

p O q

1

1 11

1 1

( , | ) [ ( ) , ( 1) ]

( , | ) [ ( ) ]

T K

k k kt k

ij T K

k kt k

p O q q t i q t ja

p O q q t i

O=aaaaabbbbb,

q1=1111111111

P(O,q1|)

q2=11111112211 … qK=2222222222

P(O,q2|)P(O,qK|)

1

1 11

1 1

( , | ) [ ( ) , ]( )

( , | ) [ ( ) ]

T K

k k t jt k

i j T K

k kt k

p O q q t i o vb v

p O q q t i

New ’

Computation of P(O,qk|) is expensive …

Page 29: Hidden Markov Models (HMMs)

29

Baum-Welch Algorithm

( ) ( ( ) | , )( , ) ( ( ) , ( 1) | , )

t i

t i j

i p q t s Oi j p q t s q t s O

Basic “counters”: Being at state si at time t

Being at state si at time t and at state sj at time t+1

Complexity: O(N2)1

1 1

1

1 1

( ) ( )( )( ) ( )

( ) ( ) ( )( , )

( ) ( )

( ) ( )( )

( )

t tt N

t tj

t ij j t tt N

t tj

ij j t tt

t

i iij j

i a b o ji j

j j

a b o ji

i

Computation of counters:

Page 30: Hidden Markov Models (HMMs)

30

Baum-Welch Algorithm (cont.)

Updating formulas:

'1

1

' 11

' 1 1

1

1

( )

( , )

( , ')

( ) [ ]( )

( )

i

T

tt

ij N T

tj t

T

t t kt

i k T

tt

i

i ja

i j

i o vb v

i

Overall complexity for each iteration: O(TN2)

Page 31: Hidden Markov Models (HMMs)

31

An HMM for Information Extraction (Research Paper Headers)

Page 32: Hidden Markov Models (HMMs)

32

What You Should Know

•Definition of an HMM•What are the three problems associated

with an HMM?•Know how the following algorithms work

– Viterbi algorithm– Forward & Backward algorithms

•Know the basic idea of the Baum-Welch algorithm

Page 33: Hidden Markov Models (HMMs)

33

Readings

•Read [Rabiner 89] sections I, II, III•Read the “brief note”