Shlomo Moran, following Dan Geiger and Nir Friedman

37
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings : Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001. o Moran, following Dan Geiger and Nir Friedman

description

Lecture #8 : - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM. Background Readings : Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis , Durbin et al., 2001. - PowerPoint PPT Presentation

Transcript of Shlomo Moran, following Dan Geiger and Nir Friedman

Page 1: Shlomo Moran, following Dan Geiger and Nir Friedman

.

Lecture #8:

- Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training

- Extensions of HMM

Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Shlomo Moran, following Dan Geiger and Nir Friedman

Page 2: Shlomo Moran, following Dan Geiger and Nir Friedman

2

Parameter Estimation in HMM:What we saw until now

Page 3: Shlomo Moran, following Dan Geiger and Nir Friedman

3

The Parameters

To determine the values of the parameters θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.

An HMM model is defined by the probabilty parameters: mkl and ek(b), for all states k,l and all symbols b.

θ denotes the collection of these parameters.

Page 4: Shlomo Moran, following Dan Geiger and Nir Friedman

4

Maximum Likelihood Parameter Estimation for HMM

looks for θ which maximizes:

p(x1,..., xn|θ) = ∏j p (xj|θ).

Page 5: Shlomo Moran, following Dan Geiger and Nir Friedman

5

Finding ML parameters for HMM when all states are known:

Let Mkl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We look for parameters ={mkl, ek(b)} that:

( )

( , ) ( , )

Maximize [ ( )]kl kM E bkkl

k l k b

m e b k

b

Subject to: for all states , 1, and e ( ) 1.kll

k m b

The optimal ML parameters θ are given by:

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

M E bm e b

M E b

Page 6: Shlomo Moran, following Dan Geiger and Nir Friedman

6

Case 2: State paths are unknown

We need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 7: Shlomo Moran, following Dan Geiger and Nir Friedman

7

Parameter Estimation when States are Unknown

The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ* so that p(x|θ*) > p(x|θ) 3. set θ = θ*.4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

Page 8: Shlomo Moran, following Dan Geiger and Nir Friedman

8

Baum Welch Training

We start with some values of mkl and ek(b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.

p(x|θ*) > p(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 9: Shlomo Moran, following Dan Geiger and Nir Friedman

9

When states were known, we counted…

In case 1 we computed the optimal values of mkl and ek(b), (for the optimal θ) by simply counting the number Mkl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.

Si= lSi-1= k

xi-1= b

… …

xi= c

Page 10: Shlomo Moran, following Dan Geiger and Nir Friedman

10

When states are unknown Mkl and Ek(b) are taken as averages:

Mkl and Ek(b) are computed according to the current distribution θ, that is:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Similarly Ek(b)=∑sEs

k (b)p(s|x,θ), where Es

k (b) is the number of times k emits b in the sequence s with output x.

( | , ),skl kl

s

M M p s x where Ms

kl is the number of k to l transitions in the sequence s.

Page 11: Shlomo Moran, following Dan Geiger and Nir Friedman

11

Computing averages of state-transitions:

Since the number of sequences s is exponential in L, it is too expensive to compute Mkl=∑sMs

klp(s|x,θ) in the naïve way.Hence, we use dynamic programming:For each each pair (k,l) and for each edge si-1 si we compute the average number of “k to l” transitions over this edge. Then we take Mkl to be the sum over all edges.

Si= ?Si-1= ?

xi-1= b xi= c

… …

Page 12: Shlomo Moran, following Dan Geiger and Nir Friedman

12

…and of Letter-Emissions

Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:

Si = ?

xi= b

Page 13: Shlomo Moran, following Dan Geiger and Nir Friedman

13

Baum Welch: step E for Mkl

Count avearge number of state transitions

For computing the averages, Baum Welch computes for each index i and states k,l, the following probability:

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

P(si-1=k, si=l | x,θ)

For this, it uses the forwards and backwards algorithms

Page 14: Shlomo Moran, following Dan Geiger and Nir Friedman

15

Baum Welch: Step E for Mkl

Claim: By the probability distribution of HMM

s1 l sL

X1 Xi XL

k

Xi-1.. ..

1

1k kl l i li i

F i m e x B ip s k s l x

p x

( ) ( ) ( )( , | , )

( | )

(mkl and el(xi) are the parameters defined by , and Fk(i-1), Bk(i) are the outputs of the forward / backward algorithms)

Page 15: Shlomo Moran, following Dan Geiger and Nir Friedman

16

proof of claim

P(x1,…,xL,si-1=k,si=l|) = P(x1,…,xi-1,si-1=k|) mklel(xi ) P(xi+1,…,xL |si=l,)

= Fk(i-1) mklel(xi ) Bl(i)

Via the forward algorithm

Via the backward algorithm

s1 s2 sL-1 sL

X1 X2 XL-1 XL

Si-1

Xi-1

si

Xi

x

p(si-1=k,si=l | x, ) = Fk(i-1) mklel(xi ) Bl(i)

)|( xp

Page 16: Shlomo Moran, following Dan Geiger and Nir Friedman

17

Step E for Mkl (end)

For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

11

1

1

11

L

kl i ii

L

k kl l i li

M p s k s l xp x

F i m e x B ip x

( , , | )( | )

( ) ( ) ( )( | )

Page 17: Shlomo Moran, following Dan Geiger and Nir Friedman

18

Step E for Mkl , with many sequences:

Claim: When we have n independent input sequences (x1,..., xn ) of lengths L1 .. Ln , then Mkl is given by:

11 1

1 1

1( = , = , )

( )

1( 1) ( ) ( )

( )

|j

j

n L

kl i ijj i

n Lj j

kl l ik ljj i

jM p s k s lp x

F i m e x B ip x

x

where and are the forward and backward algorithms

for under .

j jk lF B

jx

Page 18: Shlomo Moran, following Dan Geiger and Nir Friedman

19

Proof of Claim:When we have n independent input sequences (x1,..., xn ), the probability space is the product of n spaces:

1 1 2 2{ ( , ), ( , ),.., ( , ) }, where for 1,.., ,

( , ) is an HMM of length ,

and ranges over all possible state-paths of length .

n n

j j j

j j

s x s x s x j n

s x L

s L

The probability of a simple event in this space with parameters θ is given by:

1 1 2 2

1

( , ), ( , ),.., ( , ) | [( , ) | ]n

n n j j

j

p s x s x s x p s x

Page 19: Shlomo Moran, following Dan Geiger and Nir Friedman

20

Proof of Claim (cont):

The probability of that simple event given x=(x1 ,..,xn ):

1 1 2 2

1

[( , ) | ]( , ), ( , ),.., ( , ) | ,

[ | ]

j jnn n

jj

p s xp s x s x s x x

p x

The probability of the compound event (sj,xj ) given x=(x1 ,..,xn ):

[( , ) | ]( , ) | ,

[( | )]

j jj j

j

p s xp s x x

p x

Page 20: Shlomo Moran, following Dan Geiger and Nir Friedman

21

Proof of Claim (end):

11

By the same argument used for a single sequence, we get that ,

the contribution of to , is given by

( = , = , )

, ( | )

and the claim by taking the sum over all .

|j

jkl

kl

L

i ij i

kl j

M

j M

jp s k s l

Mp

j

x

x

xx

Page 21: Shlomo Moran, following Dan Geiger and Nir Friedman

22

Baum-Welch: Step E for Ek(b) count expected number of letter-emissions

for state k and each symbol b, for each i where Xi=b, compute the expected number of times that Si=k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi=b

1

1 1

1

( | ,... )

( ... , ) / ( ,... )

( ) ( ) / ( ,..., )

i L

L i L

k i k i L

p s k x x

p x x s k p x x

F s B s p x x

Page 22: Shlomo Moran, following Dan Geiger and Nir Friedman

23

Baum-Welch: Step E for Ek(b)

For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.

1

:

( ) ( ) ( )( | )

i

k k k

i x b

E b F i B ip x

Page 23: Shlomo Moran, following Dan Geiger and Nir Friedman

24

Step E for Ek(b), many sequences

Exercise: when we have n sequences (x1,..., xn ), the expected number of emissions of b from k is given by:

1 :

1( ) ( ) ( )

( ) ji

nj j

k k kjj i x b

E b F i B ip x

Page 24: Shlomo Moran, following Dan Geiger and Nir Friedman

25

Summary: the E part of the Baum Welch training

This part computes the expected numbers Mkl of k→l transitions for all pairs of states k and l, and the expected numbers Ek(b) of transmisions of symbol b from state k, for all states k and symbols b.

The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

Page 25: Shlomo Moran, following Dan Geiger and Nir Friedman

26

Baum-Welch: step M

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

M E bm e b

M E b

Use the Mkl’s, Ek(b)’s to compute the new values of mkl and ek(b). These values define θ*.

The correctness of the EM algorithm implies that, if θ* ≠ θ, then:

p(x1,..., xn|θ*) > p(x1,..., xn|θ) i.e, θ* increases the probability of the data, unless it is equal to θ. (this will follow from the correctness of the EM algorithm, to be

proved later.)

This procedure is iterated, until some convergence criterion is met.

Page 26: Shlomo Moran, following Dan Geiger and Nir Friedman

27

Viterbi training:Maximizing the probability of the most

probable path

Page 27: Shlomo Moran, following Dan Geiger and Nir Friedman

28

Assume that rather then finding θ which maximizes the likelihood of the input x1,..,xn , we wish to maximize the probability of a most probable path, ie to find parameters θ and state paths s(x1),..,s(xn) s.t.the value of

p(s(x1),..,s(xn) , x1,..,xn |θ)is maximized.Clearly, s(xj) should be the most probable path for xj under the parameters θ .We assume only one sequence (n=1).

This is done by Viterbi Training

Page 28: Shlomo Moran, following Dan Geiger and Nir Friedman

30

Viterbi training

Start from given values of mkl and ek(b), which define prior values of θ. Each iteration:Step 1: Use Viterbi’s algorithm to find a most probable path s(x) , which maximizes p(s(x), x|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 29: Shlomo Moran, following Dan Geiger and Nir Friedman

31

Viterbi training (cont)

Step 2. Use the ML method for HMM when the states are known to find θ* which maximizes p(s(x) , x|θ*).

Note : If after Step 2 we have p(s(x) , x|θ*)= p(s(x) , x|θ), then it must be that θ=θ*. In this case the next iteration will be identical to the current one, and hence we may terminate the algorithm.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

θ*

Page 30: Shlomo Moran, following Dan Geiger and Nir Friedman

32

Viterbi training (cont)

Step 3. If θ≠θ* , set θ←θ* , and repeat.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

If θ=θ* , stop.

Page 31: Shlomo Moran, following Dan Geiger and Nir Friedman

34

Extensions of HMM

Page 32: Shlomo Moran, following Dan Geiger and Nir Friedman

35

1. Monitoring probabilities of repetitions

Markov chains are rather limited in describing sequences

of symbols with non-random structures. For instance, a

Markov chain forces the distribution of segments in

which some state is repeated k+1 times to be (1-p)pk,

for some p.

A A A A

By adding states we may bypass this restriction:

Page 33: Shlomo Moran, following Dan Geiger and Nir Friedman

36

1. State duplications

An extension of Markov chain which allows the

distribution of segments in which a state is repeated k+1

times to have any desired value:

Assign k+1 states to represent the same “real” state. This

may model k repetitions (or less) with any desired

probability.

A1 A2 A3 A4

Page 34: Shlomo Moran, following Dan Geiger and Nir Friedman

37

2. Silent states States which do not emit symbols. Can be used to model repetitions. Also used to allow arbitrary jumps (may be used to model

deletions) Need to generalize the Forward and Backward algorithms for

arbitrary acyclic digraphs to count for the silent states:

Silent states:

Regular states:

Page 35: Shlomo Moran, following Dan Geiger and Nir Friedman

38

eg, the forwards algorithm should look:

Directed cycles of silent (or other) states complicate things, and should be avoided.

For a regular vertex of states which emit symbol :

and for a vertex of silent states:

l l k klu v E k

l k klu z E k

v x

F v e x F u m

z F z F u m

( , )

( , )

( ) ( ) ( )

( ) ( )

x

v

zSilent states

Regular states

symbols

Page 36: Shlomo Moran, following Dan Geiger and Nir Friedman

39

3. High Order Markov Chains

Markov chains in which the transition probabilities depends on the last k states:

P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-k)

Can be represented by a standard Markov chain with more states. eg for k=2:

AA

BBBA

AB

Page 37: Shlomo Moran, following Dan Geiger and Nir Friedman

40

4. Inhomogeneous Markov Chains

An important task in analyzing DNA sequences is recognizing the genes which code for proteins.

A triplet of 3 nucleotides – codon - codes for amino acids. It is known that in parts of DNA which code for genes, the three

codons positions has different statistics. Thus a Markov chain model for DNA should represent not only

the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities. Used in GENEMARK gene finding program (93).