CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:...
-
Upload
christina-walters -
Category
Documents
-
view
221 -
download
1
Transcript of CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:...
![Page 1: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/1.jpg)
CZ5226: Advanced BioinformaticsCZ5226: Advanced Bioinformatics
Lecture 6: HHM Method for generating motifsLecture 6: HHM Method for generating motifs
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore
![Page 2: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/2.jpg)
22
Problem in biologyProblem in biology
• Data and patterns are often not clear cut• When we want to make a method to recognise a
pattern (e.g. a sequence motif), we have to learn from the data (e.g. maybe there are other differences between sequences that have the pattern and those that do not)
• This leads to Data mining and Machine learning
![Page 3: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/3.jpg)
33
Contents:
•Markov chain models (1st order, higher order andinhomogeneous models; parameter estimation; classification)
• Interpolated Markov models (and back-off models)
• Hidden Markov models (forward, backward and Baum-Welch algorithms; model topologies; applications to genefinding and protein family modeling
A widely used machine learning approach: Markov models
![Page 4: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/4.jpg)
44
![Page 5: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/5.jpg)
55
Markov Chain ModelsMarkov Chain Models
• a Markov chain model is defined by:– a set of states
• some states emit symbols• other states (e.g. the begin state) are silent
– a set of transitions with associated probabilities
• the transitions emanating from a given state define a distribution over the possible next states
![Page 6: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/6.jpg)
66
Markov Chain ModelsMarkov Chain Models
• Given some sequence x of length L, we can ask how probable the sequence is given our model
• For any probabilistic model of sequences, we can write this probability as
• Key property of a (1st order) Markov chain: the probability of each Xi depends only on Xi-1
![Page 7: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/7.jpg)
77
Markov Chain ModelsMarkov Chain Models
Pr(cggt) = Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
![Page 8: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/8.jpg)
88
Markov Chain ModelsMarkov Chain Models
Can also have an end state, allowing the model to represent:
• Sequences of different lengths
• Preferences for sequences ending with particular symbols
![Page 9: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/9.jpg)
99
Markov Chain ModelsMarkov Chain Models
ii xx aa1
)|Pr( 11 iixx xxaa
ii
The transition parameters can be denoted by
where
Similarly we can denote the probability of a sequence x as
Where aBxi represents the transition from the begin state
![Page 10: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/10.jpg)
1010
Example ApplicationExample Application
• CpG islands– CGdinucleotides are rarer in eukaryotic genomes than
expected given the independent probabilities of C, G– but the regions upstream of genes are richer in CG
dinucleotides than elsewhere – CpG islands– useful evidence for finding genes
• Could predict CpG islands with Markov chains– one to represent CpG islands– one to represent the rest of the genome
Example includes using Maximum likelihood and Bayes’ statistical data and feeding it to a HM model
![Page 11: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/11.jpg)
1111
Estimating the Model ParametersEstimating the Model Parameters
• Given some data (e.g. a set of sequences from CpG islands), how can we determine the probability parameters of our model?
• One approach: maximum likelihood estimation– given a set of data D– set the parameters to maximize
Pr(D|)– i.e. make the data D look likely under the model
![Page 12: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/12.jpg)
1212
Maximum Likelihood EstimationMaximum Likelihood Estimation
• Suppose we want to estimate the parameters Pr(a), Pr(c), Pr(g), Pr(t)
• And we’re given the sequences: accgcgctta gcttagtgactagccgttac
• Then the maximum likelihood estimates are:
Pr(a) = 6/30 = 0.2 Pr(g) = 7/30 = 0.233Pr(c) = 9/30 = 0.3 Pr(t) = 8/30 = 0.267
![Page 13: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/13.jpg)
1313
![Page 14: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/14.jpg)
1414
![Page 15: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/15.jpg)
1515
![Page 16: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/16.jpg)
1616
![Page 17: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/17.jpg)
1717
These data are derived from genome sequences
![Page 18: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/18.jpg)
1818
![Page 19: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/19.jpg)
1919
![Page 20: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/20.jpg)
2020
![Page 21: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/21.jpg)
2121
Higher Order Markov ChainsHigher Order Markov Chains
• An nth order Markov chain over some alphabet is equivalent to a first order Markov chain over the alphabet of n-tuples
• Example: a 2nd order Markov model for DNA can be treated as a 1st order Markov model over alphabet:AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT (i.e. all possible dipeptides)
![Page 22: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/22.jpg)
2222
A Fifth Order Markov ChainA Fifth Order Markov Chain
![Page 23: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/23.jpg)
2323
Inhomogenous Markov ChainsInhomogenous Markov Chains
• In the Markov chain models we have considered so far, the probabilities do not depend on where we are in a given sequence
• In an inhomogeneous Markov model, we can have different distributions at different positions in the sequence
• Consider modeling codons in protein coding regions
![Page 24: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/24.jpg)
2424
Inhomogenous Markov ChainsInhomogenous Markov Chains
![Page 25: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/25.jpg)
2525
A Fifth Order InhomogeneousA Fifth Order InhomogeneousMarkov ChainMarkov Chain
![Page 26: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/26.jpg)
2626
Selecting the Order of aSelecting the Order of aMarkov Chain ModelMarkov Chain Model
• Higher order models remember more “history”• Additional history can have predictive value• Example:
– predict the next word in this sentence fragment “…finish __” (up, it, first, last, …?)
– now predict it given more history
• “Fast guys finish __”
![Page 27: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/27.jpg)
2727
Hidden Markov models (HMMs)Hidden Markov models (HMMs)
Given say a T in our input sequence, which state emitted it?
![Page 28: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/28.jpg)
2828
Hidden Markov models (HMMs)Hidden Markov models (HMMs)
Hidden State
• We will distinguish between the observed parts of a problem and the hidden parts• In the Markov models we have considered previously, it is clear which state accounts for each part of the observed sequence • In the model above (preceding slide), there are multiple states that could account for each part of the observed sequence– this is the hidden part of the problem– states are decoupled from sequence symbols
![Page 29: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/29.jpg)
2929
HMM-based homology searchingHMM-based homology searching
Transition probabilities and Emission probabilities
Gapped HMMs also have insertion and deletion states
![Page 30: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/30.jpg)
3030
Profile HMMProfile HMM: m=match state, I-insert state, d=delete state; go from: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are ‘silent”. left to right. I and m states output amino acids; d states are ‘silent”.
d1 d2 d3 d4
I0 I2 I3 I4I1
m0 m1 m2 m3 m4 m5
Start End
![Page 31: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/31.jpg)
3131
HMM-based homology searchingHMM-based homology searching
• Most widely used HMM-based profile searching tools currently are SAM-T99 (Karplus et al., 1998) and HMMER2 (Eddy, 1998)
• formal probabilistic basis and consistent theory behind gap and insertion scores
• HMMs good for profile searches, bad for alignment (due to parametrisation of the models)
• HMMs are slow
![Page 32: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/32.jpg)
3232
Homology-derived Secondary Structure of Proteins
Sander & Schneider, 1991
![Page 33: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/33.jpg)
3333
The Parameters of an HMMThe Parameters of an HMM
![Page 34: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/34.jpg)
3434
HMM for Eukaryotic Gene FindingHMM for Eukaryotic Gene Finding
Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences
![Page 35: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/35.jpg)
3535
A Simple HMMA Simple HMM
![Page 36: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/36.jpg)
3636
Three Important QuestionsThree Important Questions
• How likely is a given sequence?the Forward algorithm
• What is the most probable “path” for generating a given sequence?
the Viterbi algorithm
• How can we learn the HMM parameters given a set of sequences? the Forward-Backward
(Baum-Welch) algorithm
![Page 37: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/37.jpg)
3737
How Likely is a Given Sequence?How Likely is a Given Sequence?
• The probability that the path is taken and the sequence is generated:
• (assuming begin/end are the only silent states on path)
![Page 38: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/38.jpg)
3838
How Likely is a Given Sequence?How Likely is a Given Sequence?
![Page 39: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/39.jpg)
3939
How Likely is a Given Sequence?How Likely is a Given Sequence?
The probability over all paths is:
but the number of paths can be exponential in the length of the sequence...
• the Forward algorithm enables us to compute this efficiently
![Page 40: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/40.jpg)
4040
How Likely is a Given Sequence:How Likely is a Given Sequence:The Forward AlgorithmThe Forward Algorithm
• Define fk(i) to be the probability of being in state k
• Having observed the first i characters of x we want to compute fN(L), the probability of being in the end state having observed all of x
• We can define this recursively
![Page 41: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/41.jpg)
4141
How Likely is a Given Sequence:How Likely is a Given Sequence:
![Page 42: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/42.jpg)
4242
The forward algorithmThe forward algorithm• Initialisation:
f0(0) = 1 (start),
fk(0) = 0 (other silent states k)
• Recursion: fl(i) = el(i)k fk(i-1)akl (emitting states),
fl(i) = k fk(i)akl (silent states)
• Termination:
Pr(x) = Pr(x1…xL) = f N(L) = k fk(L)akN probability that we are in the end state and have observed the entire sequence
probability that we’re in start state and have observed 0 characters from the sequence
![Page 43: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/43.jpg)
4343
Forward algorithm exampleForward algorithm example
![Page 44: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/44.jpg)
4444
Three Important QuestionsThree Important Questions
• How likely is a given sequence?
• What is the most probable “path” for generating a given sequence?
• How can we learn the HMM parameters given a set of sequences?
![Page 45: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/45.jpg)
4545
Finding the Most Probable Path:Finding the Most Probable Path:The Viterbi AlgorithmThe Viterbi Algorithm
• Define vk(i) to be the probability of the most probable path accounting for the first i characters of x and ending in state k
• We want to compute vN(L), the probability of the most probable path accounting for all of the sequence and ending in the end state
• Can be defined recursively
• Can use DP to find vN(L) efficiently
![Page 46: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/46.jpg)
4646
Finding the Most Probable Path:Finding the Most Probable Path:The Viterbi AlgorithmThe Viterbi Algorithm
Initialisation:
v0(0) = 1 (start), vk(0) = 0 (non-silent states)
Recursion for emitting states (i =1…L):
Recursion for silent states:
![Page 47: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/47.jpg)
4747
Finding the Most Probable Path:Finding the Most Probable Path:The Viterbi AlgorithmThe Viterbi Algorithm
![Page 48: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/48.jpg)
4848
Three Important QuestionsThree Important Questions
• How likely is a given sequence? (clustering)
• What is the most probable “path” for generating a given sequence? (alignment)
• How can we learn the HMM parameters given a set of sequences?
![Page 49: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/49.jpg)
4949
The Learning TaskThe Learning Task
• Given:– a model– a set of sequences (the training set)
• Do:– find the most likely parameters to explain the training sequences
• The goal is find a model that generalizes well to sequences we haven’t seen before
![Page 50: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/50.jpg)
5050
Learning ParametersLearning Parameters
• If we know the state path for each training sequence, learning the model parameters is simple– no hidden state during training– count how often each parameter is used– normalize/smooth to get probabilities– process just like it was for Markov chain models
• If we don’t know the path for each training sequence, how can we determine the counts?– key insight: estimate the counts by considering every
path weighted by its probability
![Page 51: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/51.jpg)
5151
Learning Parameters:Learning Parameters:The Baum-Welch AlgorithmThe Baum-Welch Algorithm
• An EM (expectation maximization) approach, a forward-backward algorithm
• Algorithm sketch:– initialize parameters of model– iterate until convergence
• Calculate the expected number of times each transition or emission is used
• Adjust the parameters to maximize the likelihood of these expected values
![Page 52: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/52.jpg)
5252
The Expectation stepThe Expectation step
![Page 53: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/53.jpg)
5353
The Expectation stepThe Expectation step
![Page 54: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/54.jpg)
5454
The Expectation stepThe Expectation step
![Page 55: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/55.jpg)
5555
The Expectation stepThe Expectation step
![Page 56: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/56.jpg)
5656
The Expectation stepThe Expectation step
• First, we need to know the probability of the i th symbol being produced by state q, given sequence x:
Pr( i = k | x)
•Given this we can compute our expected counts for state transitions, character emissions
![Page 57: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/57.jpg)
5757
The Expectation stepThe Expectation step
![Page 58: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/58.jpg)
5858
The Backward AlgorithmThe Backward Algorithm
![Page 59: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/59.jpg)
5959
The Expectation stepThe Expectation step
![Page 60: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/60.jpg)
6060
The Expectation stepThe Expectation step
![Page 61: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/61.jpg)
6161
The Expectation stepThe Expectation step
![Page 62: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/62.jpg)
6262
The Maximization stepThe Maximization step
![Page 63: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/63.jpg)
6363
The Maximization stepThe Maximization step
![Page 64: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg .](https://reader036.fdocuments.in/reader036/viewer/2022062305/5697bfa81a28abf838c995fd/html5/thumbnails/64.jpg)
6464
The Baum-Welch AlgorithmThe Baum-Welch AlgorithmInitialize parameters of model
• Iterate until convergence– calculate the expected number of times each transition or emission is used– adjust the parameters to maximize the likelihood of these expected values
• This algorithm will converge to a local maximum (in the likelihood of the data given the model)
• Usually in a fairly small number of iterations