Protein homology detection by HMM–HMM comparison Johannes Söding
Slide3 HMM
-
Upload
dungocluongvu -
Category
Documents
-
view
223 -
download
0
Transcript of Slide3 HMM
-
8/8/2019 Slide3 HMM
1/51
Hidden Markov Models
Ts. Nguyn Vn Vinh
B mn KHMT, Trng HCN, H QG H ni
-
8/8/2019 Slide3 HMM
2/51
2
Why Learn ? Machine learning is programming computers to optimize a performance
criterion using example data or past experience.
There is no need to learn to calculate payroll
Learning is used when: Human expertise does not exist (navigating on Mars),
Humans are unable to explain their expertise (speech recognition)
Solution changes in time (routing on a computer network)
Solution needs to be adapted to particular cases (user biometrics)
-
8/8/2019 Slide3 HMM
3/51
3
WhatWe Talk AboutWhen We Talk
AboutLearning Learning general models from a data of particular examples
Data is cheap and abundant (data warehouses, data marts); knowledge is
expensive and scarce.
Example in retail: Customer transactions to consumer behavior:
People who bought Da Vinci Code also bought The Five People You Meet
in Heaven (www.amazon.com)
Build a model that is a good and useful approximation to the data.
-
8/8/2019 Slide3 HMM
4/51
4
What is Machine Learning? Optimize a performance criterion using example data or past
experience.
Role of Statistics: Inference from a sample Role of Computer science: Efficient algorithms to
Solve the optimization problem
Representing and evaluating the model for inference
-
8/8/2019 Slide3 HMM
5/51
-
8/8/2019 Slide3 HMM
6/51
Part-of-Speech Tagging Input:
Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO
Alan Mulally announced first quarter results.
Output:
Profits/Nsoared/Vat/PBoeing/NCo./N,/, easily/ADVtopping/Vforecasts/Non/P
Wall/NStreet/N,/, as/Ptheir/POSSCEO/NAlan/NMulally/Nannounced/V
first/ADJquarter/Nresults/N./.
N= Noun
V= Verb
P= Preposition
Adv=Adverb
Adj=Adjective
.
-
8/8/2019 Slide3 HMM
7/51
7
Face RecognitionTraining examples of a person
Test images
AT&T Laboratories, Cambridge UKhttp://www.uk.research.att.com/facedatabase.html
-
8/8/2019 Slide3 HMM
8/51
8
Introduction Modeling dependencies in input
Sequences: Temporal: In speech; phonemes in a word (dictionary), words in
a sentence (syntax, semantics of the language).
In handwriting, pen movements
Spatial: In a DNA sequence; base pairs
-
8/8/2019 Slide3 HMM
9/51
9
Discrete Markov Process N states: S1, S2, ..., SN State at time t, qt= Si First-order Markov
P(qt+1
=Sj
| qt=S
i, q
t-1=S
k,...) = P(q
t+1=S
j| q
t=S
i)
Transition probabilities
aij P(qt+1=Sj | qt=Si) aij 0 and j=1Naij=1
Initial probabilities
i P(q1=Si) j=1Ni=1
-
8/8/2019 Slide3 HMM
10/51
10
Stochastic Automaton
TT qqqqq
T
t
tt aaqqPqP,QOP 12112
11 || T!!! !
.4A
-
8/8/2019 Slide3 HMM
11/51
11
Example: Balls and Urns Three urns each full of balls of one color
S1: red, S2: blue, S3: green
? A
_ a
048080304050
||||
801010
206020
303040
302050
3313111
3313111
3311
.....
aaa
SSSSSSS,
S,S,S,S
...
...
...
.,.,.T
!!
T!
!
!
-
!!
4
4
A
A
-
8/8/2019 Slide3 HMM
12/51
12
Balls and Urns: Learning Given K example sequences of length T
_ a_ a
_ a_ a
!
!
!
!!!
!
!!!T
k
T-
t ik
t
k
T-
t jk
tik
t
i
ji
ij
k ik
ii
S
SS
S#
SS#a
KS
#S#
1
1
1
1 1
1
1
nd1
fromstr nsition
tofromstr nsition
1
sequen esithst rtinsequen es
-
8/8/2019 Slide3 HMM
13/51
13
Hidden Markov Models States are not observable
Discrete observations {v1,v2,...,vM} are recorded; a
probabilistic function of the state Emission probabilities
bj(m) P(Ot=vm | qt=Sj)
Example: In each urn, there are balls of different colors,but with different probabilities.
For each observation sequence, there are multiple state
sequences
-
8/8/2019 Slide3 HMM
14/51
Hidden Markov Model (HMM) HMMs allow you to estimate probabilities of
unobserved events
Given plain text, which underlying parameters
generated the surface
E.g., in speech recognition, the observed data is
the acoustic signal and the words are the hiddenparameters
-
8/8/2019 Slide3 HMM
15/51
HMMs and their Usage HMMs are very common in Computational Linguistics:
Speech recognition (observed: acoustic signal, hidden: words)
Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden:part-of-speech
tags)
Machine translation (observed: foreign words, hidden: words in
target language)
-
8/8/2019 Slide3 HMM
16/51
Noisy Channel Model In speech recognition you observe an acoustic
signal (A=a1,,an) and you want to determine the
most likely sequence of words (W=w1,,wn): P(W |A)
Problem: A andWare too specific for reliable
counts on observed data, and are very unlikely tooccur in unseen data
-
8/8/2019 Slide3 HMM
17/51
Noisy Channel Model Assume that the acoustic signal (A) is already segmented
wrt word boundaries
P(W
| A) could be computed as
Problem: Finding the most likely word corresponding to a
acoustic representation depends on the context E.g., /'pre-z&ns / could mean presents or presence
depending on the context
P(W | A) ! maxw ia i
P(wi | ai )
-
8/8/2019 Slide3 HMM
18/51
Noisy Channel Model Given a candidate sequence Wwe need to
compute P(W) and combine it with P(W | A)
Applying Bayes rule:
The denominator P(A) can be dropped, because itis constant for allW
argmaxW
(W ) ! argmaxW
( W) (W)
( )
-
8/8/2019 Slide3 HMM
19/51
19
Noisy Channel in a Picture
-
8/8/2019 Slide3 HMM
20/51
DecodingThe decoder combines evidence from
The likelihood: P(A | W)
This can be approximated as:
The prior: P(W)
This can be approximated as:
P(W) } P(w1 ) P(w ii! 2
n
| wi1)
P(A |W) } P(aii!1
n
| wi )
-
8/8/2019 Slide3 HMM
21/51
Search Space Given a word-segmented acoustic sequence list all
candidates
Compute the most likely path
'bot ik-'spen-siv 'pre-z &ns
boat excessive presidents
bald expensive presence
bold expressive presents
bought inactive press
P(inactive | bald)
P('bot bald)
-
8/8/2019 Slide3 HMM
22/51
Markov Assumption The Markov assumption states that probability of
the occurrence of word wiat time t depends only
on occurrence of word wi-1 at time t-1 Chain rule:
Markov assumption:
P(w1,...,wn ) } P(wi | w i1)i! 2
n
P(w1,...,wn ) ! P(wi w1,...,wi1)i! 2
n
-
8/8/2019 Slide3 HMM
23/51
The Trellis
-
8/8/2019 Slide3 HMM
24/51
Parameters of an HMM States:A set of states S=s1,,sn Transition probabilities: A= a1,1,a1,2,,an,n Each ai,j
represents the probability of transitioning from state si to sj. Emission probabilities: A set B of functions of the form
bi(ot) which is the probability of observation ot beingemitted by si
Initial state distribution: is the probability that si is a startstate
T i
-
8/8/2019 Slide3 HMM
25/51
The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence
O=o1,,oTand an HMM model
, how do we compute the probability of Ogiven the model?
Problem 2(Decoding): Given the observation sequence
O=o1,,oTand an HMM model
, how do we find the state sequence that best
explains the observations?
P ! (A,B,T )
P ! (A,B,T )
-
8/8/2019 Slide3 HMM
26/51
Problem 3 (Learning): How do we adjust the model
parameters , to maximize ?
The Three Basic HMM Problems
P ! (A,B,T )
P(O P)
-
8/8/2019 Slide3 HMM
27/51
Problem 1: Probability of an Observation
Sequence What is ?
The probability of a observation sequence is the sum of
the probabilities of all possible state sequences in theHMM.
Nave computation is very expensive. Given Tobservations and N states, there are NTpossible statesequences.
Even small HMMs, e.g. T=10 and N=10, contain 10 billiondifferent paths
Solution to this and problem 2 is to use dynamic
programming
P(O | P)
-
8/8/2019 Slide3 HMM
28/51
Forward Probabilities What is the probability that, given an HMM , at
time t the state is i and the partial observation o1
othas been generated?
E t(i) ! P(o1... ot, qt ! si | P)
P
-
8/8/2019 Slide3 HMM
29/51
Forward Probabilities
t(j) ! E t1(i) aiji!1
N
-
bj (ot)
Et(i) ! P(o1...ot, qt ! si | P)
-
8/8/2019 Slide3 HMM
30/51
Forward Algorithm Initialization:
Induction:
Termination:
E t(j) ! E t1(i) aiji!1
N
-
bj (ot) 2 e te T, 1e j e N
E1(i) ! T ibi(o1) 1e i e N
P(O | P) ! ET(i)i!1
N
-
8/8/2019 Slide3 HMM
31/51
Forward Algorithm Complexity In the nave approach to solving problem 1 it takes
on the order of2T*NTcomputations
The forward algorithm takes on the order ofN2Tcomputations
-
8/8/2019 Slide3 HMM
32/51
Backward ProbabilitiesAnalogous to the forward probability, just in the
other direction
What is the probability that given an HMM andgiven the state at time t is i, the partial observation
ot+1 oT is generated?
F t(i) ! P(ot1...o |qt ! si,P)
P
-
8/8/2019 Slide3 HMM
33/51
Backward Probabilities
Ft(i) ! aijbj (ot1)Ft1(j)j!1
N
-
Ft(i) ! P(ot1...o |qt ! si,P)
-
8/8/2019 Slide3 HMM
34/51
Backward Algorithm Initialization:
Induction:
Termination:
FT(i) !1, 1e i e N
F t(i) ! aijbj (ot1)F t1(j)j!1
N
-
t! T1...1,1 e i e N
P(O | P) ! T iF1(i)i!1
N
-
8/8/2019 Slide3 HMM
35/51
Problem2: Decoding
The solution to Problem 1 (Evaluation) gives us the sum of
all paths through an HMM efficiently.
For Problem 2, we wan to find the path with the highestprobability.
We want to find the state sequence Q=q1qT, such that
Q ! argmaxQ'
P(Q' |O,P)
-
8/8/2019 Slide3 HMM
36/51
Viterbi Algorithm Similar to computing the forward probabilities, but
instead of summing over transitions from incoming
states, compute the maximum Forward:
Viterbi Recursion:
E t(j) ! E t1(i) aiji!1
N
-
bj (ot)
Ht(j) ! max1e ieN
Ht1(i)aij? Abj (ot)
-
8/8/2019 Slide3 HMM
37/51
Viterbi Algorithm Initialization:
Induction:
Termination:
Read out path:
H1(i) ! T ibj (o1) 1e i e N
Ht(j) ! max1e ieN
Ht1(i)aij? Abj (ot)
]t(j) ! argmax1e ieN
Ht1(i)aij
-
2 e te T, 1 e j e N
p* ! max
1e ieNHT(i) qT
*! argmax
1e ieN
HT(i)
qt
* !]t1
(qt1
*) t! T1,...,1
-
8/8/2019 Slide3 HMM
38/51
Problem3: Learning
Up to now weve assumed that we know the underlyingmodel
Often these parameters are estimated on annotatedtraining data, which has two drawbacks: Annotation is difficult and/or expensive
Training data is different from the current data
We want to maximize the parameters with respect to thecurrent data, i.e., were looking for a model , such that
! (A,B, )
P' P'! argmaxP
P(O | P)
-
8/8/2019 Slide3 HMM
39/51
Problem3: Learning
Unfortunately, there is no known way to analytically find aglobal maximum, i.e., a model , such that
But it is possible to find a local maximum Given an initial model , we can always find a model ,
such that
P'
P'! argmaxP
P(O | P)
P
P'
P(O | P') u P(O | P)
-
8/8/2019 Slide3 HMM
40/51
Parameter Re-estimation Use the forward-backward (or Baum-Welch)
algorithm, which is a hill-climbing algorithm
Using an initial parameter instantiation, theforward-backward algorithm iteratively re-estimatesthe parameters and improves the probability thatgiven observation are generated by the new
parameters
-
8/8/2019 Slide3 HMM
41/51
Parameter Re-estimation Three parameters need to be re-estimated:
Initial state distribution:
Transition probabilities: ai,j
Emission probabilities: bi(ot)
T i
-
8/8/2019 Slide3 HMM
42/51
Re-estimating Transition Probabilities Whats the probability of being in state siat time t
and going to state sj, given the current model and
parameters?
\t(i, j) ! P(qt ! si, qt1 ! sj | O,P)
-
8/8/2019 Slide3 HMM
43/51
Re-estimating Transition Probabilities
t(i, j) !E t(i) ai, j bj (ot1)Ft1(j)
E t(i) ai, j bj (ot1)F t1(j)j!1
N
i!1
N
t(i, j) ! P(qt ! si, qt1 ! sj |O,P)
-
8/8/2019 Slide3 HMM
44/51
Re-estimating Transition Probabilities The intuition behind the re-estimation equation for
transition probabilities is
Formally:
ai, j !expected numberoftransitions from state s ito state sj
expected numberoftransitions from state s i
ai, j !\t(i, j)
t!1
T1
\t(i, j ')j '!1
N
t!1
T1
-
8/8/2019 Slide3 HMM
45/51
Re-estimating Transition Probabilities Defining
As the probability of being in state si, given the
complete observation O
We can say: ai, j !
\t(i, j)t!1
T1
Kt(i)t!1
T1
Kt(i) ! \t(i, j)j!1
N
-
8/8/2019 Slide3 HMM
46/51
Review of Probabilities Forward probability:
The probability of being in state si, given the partial observationo1,,ot
Backward probability:
The probability of being in state si, given the partial observationot+1,,oT
Transition probability:
The probability of going from state si, to state sj, given thecomplete observation o1,,oT
State probability:
The probability of being in state si, given the complete
observation o1,,oT
E t(i)
F t(i)
\t(i, j)
Kt(i)
-
8/8/2019 Slide3 HMM
47/51
Re-estimating Initial State Probabilities Initial state distribution: is the probability that si is
a start state
Re-estimation is easy:
Formally:
T i
Ti ! expected numbero times in state s iattime1
Ti ! K1(i)
-
8/8/2019 Slide3 HMM
48/51
Re-estimation of Emission Probabilities Emission probabilities are re-estimated as
Formally:
Where
Note that here is the Kronecker delta function and is not related to thein the discussion of the Viterbi algorithm!!
bi(k) !
expected numberoftimes in state siand observe symbol vk
exp
ectednumb
erof
tim
es in s
tate
s i
bi (k) !
H(ot,vk)Kt(i)t!1
T
Kt(i)t!1
T
H(ot,vk) ! 1, ifot ! vk, and 0 otherwise
H
H
-
8/8/2019 Slide3 HMM
49/51
The Updated Model Coming from we get to
by the following update rules:
P ! (A,B,T )
P'! ( A, B, T)
bi (k) !
H(ot,vk)Kt(i)t!1
T
Kt(i)t!1
T
ai, j !
\t(i, j)t!1
T1
Kt(i)t!1
T1
Ti ! K1(i)
-
8/8/2019 Slide3 HMM
50/51
Expectation Maximization The forward-backward algorithm is an instance of
the more general EM algorithm
The E Step: Compute the forward and backwardprobabilities for a give model
The M Step: Re-estimate the model parameters
-
8/8/2019 Slide3 HMM
51/51
Exercise Programming with Viterbi Algorithm
Apply HMM forPart-of-Speech Tagging