Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM...
Transcript of Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM...
![Page 1: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/1.jpg)
1
Prof. Yechiam Yemini (YY)
Computer Science DepartmentColumbia University
Chapter 4: Hidden Markov Models
4.4 HMM Applications
2
Overview
Alignment with HMM Gene predictions Protein structure predictions
![Page 2: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/2.jpg)
2
3
Sequence Analysis Using HMMStep 1: construct an HMM model
Design an HMM generator for the observed sequencesAssign hidden states to underlying sequence regionsModel the questions to be answered in terms of hidden-pathway
Step 2: train the HMMSupervised/unsupervised
Step 3: analyze sequencesViterbi decoding: compute most likely hidden-pathway Forward/Backward: compute likelihood of sequence (& p/Z-score)
4
Alignment
![Page 3: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/3.jpg)
3
5
Profile HMM
S EMM M
I
D
ACTG
.6
.2
.1
.1
D
II I
D
6
Searching With Profile HMM Is the target sequence S generated by profile H?
Viterbi: use H to compute the most probable alignment of S Forward: compute probability of S generated by H
Example: searching for globinsCreate profile HMM from known globin sequencesUse the forward algorithm to search SWISS-PROT for globinCompare against the null hypothesis (random sequence)Result: globins are highly distinct
![Page 4: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/4.jpg)
4
7
Pairwise Alignment With HMM Key idea: recast DP as Viterbi decoding“Alignment” = sequence of pairs {<X,Y>,<X,_>, <_,Y>} Consider an HMM that emits pairs (pair HMM):
Begin M: PXY End
X: qX
Y: qy
ε
ε
τ
τ
τδ
δδ
δ
τ
1−2δ−τ
1−2δ−τ
1−ε−τ
1−ε−τ
8
Viterbi Decoding Pairwise Alignment
Begin M: PXY End
X: qX
Y: qy
ε
ε
τ
τ
τδ
δδ
δ
τ
1−2δ−τ
1−2δ−τ
1−ε−τ
1−ε−τ
1 2 3 4 5 6BMXYE
X=TGCGCAY=TGGCTA
TT -GGG
G-
XY-YX-
Viterbi decoding: optimum path of pairs Forward step: select among {XY,-Y,X-}
![Page 5: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/5.jpg)
5
9
Comparing With Standard DP Compute relationships scores probabilities Consider an HMM that emits pairs (pair HMM)
M
X
Ys(Xi,Yj) -d
s(Xi,Yj+k)
s(Xi+k,Yj)-d
-e
-e
Begin M: PXY End
X: qX
Y: qy
ε
ε
τ
τ
τδ
δδ
δ
τ
1−2δ−τ
1−2δ−τ
1−ε−τ
1−ε−τ
<δ,ε,τ,P,q> <s(x,y),d,e>
e.g., e=-log[ε/(1−τ) ]
10
Gene Predictions
![Page 6: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/6.jpg)
6
11
Schematics Prokaryotic genes
Eukaryotic genes
gene genegenepromoter
start stop
terminator
exon exonexonpromoter
start stopdonor acceptor
intron intron
12
Modeling Gene Structure With HMMGene regions are modeled as HMM statesEmission probabilities reflect nucleotide statistics
Splice site
![Page 7: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/7.jpg)
7
13
Simple Gene Models Consider a region of random length L
Markov model L is geometrically distributed P[L=k]=pk(1-p) E[L]=p/(1-p)
How do we model ORF?Codons can be modeled as higher-order states
A=.29C=.31G=.04T=.36
14
VEIL: Viterbi Exon-Intron Locator
Contains 9 hidden states Each state is a Markovian model of regions
Exons, introns, intergenic regions, splice sites, etc.Exon HMM Model
Upstream
Start Codon
Exon
Stop Codon
Downstream
3’ Splice Site
Intron
5’ Poly-A Site
5’ Splice Site
• Enter: start codon or intron (3’ Splice Site)
• Exit: 5’ Splice site or three stop codons(taa, tag, tga)
VEIL Architecture
(Henderson, Salzberg, & Fasman 1997)
![Page 8: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/8.jpg)
8
15
Genie (Kulp 96)
Uses a generalized HMM (GHMM) Edges in model are complete HMMs States are neural networks for signal finding
• J5’ – 5’ UTR
• EI – Initial Exon
• E – Exon, Internal Exon
• I – Intron
• EF – Final Exon
• ES – Single Exon
• J3’ – 3’UTR
BeginSequence
StartTranslation
Donorsplicesite
Acceptorsplicesite
StopTranslation
EndSequence
16
GeneScan (Burge & Karlin 97)
Ex5Intergenic Promoter Ex5Ex2 In4Ex2 Ex3 Ex4 Poly AIn3In2In1Ex1
5’ UTR 3’ UTR
Singleexongene
5’ UTR
Promoter
Intergenicregion
E1E0 E2
Einit Eterm
3’ UTR
I0 I1 I2
Poly A signal
J. Mol. Bio. (1997) 268, 78-94
Base models overall parse of the gene
5’UTR/3’UTR; Promoter; Poly A….
Exon-Intron-Exon structure of gene
Intron states model 3 scenarios:
I0: Intron is between two codons
I1: Intron is right after first codon base
I2: Intron is right after second codon base
Exons model respective scenarios
![Page 9: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/9.jpg)
9
17
GeneScan Strategy
Ex5Intergenic Promoter Ex5Ex2 In4Ex2 Ex3 Ex4 Poly AIn3In2In1Ex1
5’ UTR 3’ UTR
Singleexongene
5’ UTR
Promoter
Intergenicregion
E1E0 E2
Einit Eterm
3’ UTR
I0 I1 I2
Poly A signal
HMM: sequence generator; state=region
Decoding: parse sequence into regions
Viterbi: maximum likelihood decoder via DP
These structures admit generalizations
Forward Strand
Forward Strand
18
Extending The HMM Model
Problem: the HMM does not model natureSequence lengths are not geometrically distributedSome regions have special features (e.g., splice sites use GT…)…….
Challenge: generalize HMM while retaining algorithms
Ex5Intergenic Promoter Ex5Ex2 In4Ex2 Ex3 Ex4 Poly AIn3In2In1Ex1
….ACAGTTATAGCGTAATTGCGAGCTATCATGAG……
Decode
![Page 10: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/10.jpg)
10
19
GeneScan: Extended Sequence Generator Model
Add to state k a length distribution fk(L)Semi-Markov modelExtend the Viterbi DP recursion to reflect this
Incorporate special regions structuresWeight matrix model for splice site
…. Preserve the recursion structure Single
exongene
5’ UTR
Promoter
Intergenicregion
E1E0 E2
Einit Eterm
3’ UTR
I0 I1 I2
Poly A signal
20
Measuring Classifier Performance The challenge: how do we evaluate performance of a classifier?
E.g., consider a test to decide whether a person is sick (p) or healthy (n) Measured Performance
TP=True Positive; TN=True Negative; FP=False Positive; FN = False Negative
Sensitivity: SN=TP/(TP+FN) Sensitivity =percentage of correct predictions when the actual state is positive
Specificity: SP =TN/(TN+FP) Specificity=percentage of correct predictions when the actual state is negative
Receiver Operating Characteristics (ROC) curves Represent tradeoff between sensitivity & specificity
TP
TN
FN
FP
P
N
P’ N’prediction
Actual
Sensitivity
1-specificity
10.5
0.5 1.0
ROC curves
![Page 11: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/11.jpg)
11
21
Performance Comparisons
22
IdentifyingTransmembrane Proteins
![Page 12: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/12.jpg)
12
23
Transmembrane Proteins Transmembrane proteins are the key tools for cells interactions
Signaling, transport, adhesion…
A common structural motif: core helices Example: receptors
Function: receive signals, activate pathways Operations:
1. Signal recognition2. Signal transduction3. Pathway activation4. Down regulation
www.pdb.org
24
Example: The Insulin Pathway
From Wikipedia
Insulin binds toInsR receptor
Receptor activatespathway
Glut-4 is transported tomembrane to generate
glucose influx
Metabolic networkprocessing
InsR phosphorylated complex
![Page 13: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/13.jpg)
13
25
TMHMM [Sonhammer et al 98]
A hidden Markov Model for Predicting Transmembrane Helices in ProteinIn J. Glasgow et al., eds., Proc. Sixth Int. Conf. on Intelligent Systems forMolecular Biology, 175-182. AAAI Press, 1998. 1
A HMM to identifytransmembrane proteins
Architecture: The helix core is key Globular and loop domains Modeling challenge: variable length
Training The Maximization step of Baum
Welch uses simulated annealing 3-stage training
26
TMHMM Performance
![Page 14: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/14.jpg)
14
27
ProteinsStructure Prediction
28
Proteins Are Formed From Peptide Bonds
Ramachandran Regions
![Page 15: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/15.jpg)
15
29
Levels of Protein Structure
30
Structure Formation
![Page 16: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/16.jpg)
16
31
Helix is Formed Through H Bonds
32
Tertiary Structures: b-Barrel Example
![Page 17: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/17.jpg)
17
33
Why Is Structure Important?Protein functions are handled by structural elements
HIV protease drug binding siteEnzyme active site
Antibody-protein interfaces
34
Structure DeterminationMap: sequence structure
Structure determination is handled through crystalographyX-Ray; NMR
PDB: a database of structures
![Page 18: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/18.jpg)
18
35
The Problem Of Structure Prediction Derive conformation geometry from sequence
Ab-initio techniquesHomology techniques
Secondary structure is organized in folds α-helix, β-sheets….
36
The Structure Prediction Problem
Given: an amino acid sequenceOutput: structure annotation {H,B…}
![Page 19: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/19.jpg)
19
37
HMM Model Hidden states: folds Observable: AA sequence Viterbi decoding: find most likely annotation Training: supervised learning use known structures Performance: HMM works well for limited protein types
E.g., TMHMM… For general structure predictions Neural-Nets are superior
(e.g., PHD, R. Burkhard 93)Why?
o [HMM works as long as the underlying conformation is consistent with the Markovian model.Protein conformations may depend on complex long-range non-Markovian interactions. E.g.,consider the impact of a single AA change on hemoglobin conformation in sickle-cell anemia. ]
38
Example (Chu et al, 2004)Step 1: align sequences to partition into regions
Step 2: build extended HMM (semi-markov length)
www.gatsby.ucl.ac.uk/~chuwei/paper/icml04_talk.pdf
![Page 20: Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.4 HMM Applications 2 Overview ... A hidden Markov Model for Predicting Transmembrane Helices](https://reader034.fdocuments.in/reader034/viewer/2022050918/5c882c9809d3f24c348d1db5/html5/thumbnails/20.jpg)
20
39
Performance
Challenges long-range interactions (e.g., β-sheets)
40
Conclusions HMM are very useful when a Markovian model is appropriate
Solve three central problems: Decoding sequences to classify their components Computing likelihood of the classification Training
May be extended to handle non-Markovian elements
But can lose their predictive power when Markovian assumptions
diverge from nature