Introduction to Probabilistic Sequence Models: Theory and Applications
description
Transcript of Introduction to Probabilistic Sequence Models: Theory and Applications
![Page 1: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/1.jpg)
Introduction to Probabilistic Sequence Models:
Theory and Applications
David H. Ardell,Forskarassistent
![Page 2: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/2.jpg)
Lecture Outline: Intro. to Probabilistic Sequence Models
Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions
Probabilistic Sequence Models: profiles, HMMs, SCFG
![Page 3: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/3.jpg)
Consensus sequences revisited
Consense sequences make poor summaries
A T C G
![Page 4: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/4.jpg)
A motif is a short stretch of protein sequence associated with a particular function (R. Doolittle, 1981)
The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins
[GA]x(4)GK[ST]
A variety of databases of such motifs exist: such as BLOCKS, PROSITE, and PRINTS, and there are many tools to search proteins for matches to blocks.
![Page 5: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/5.jpg)
Introduction to Regular Expressions (Regexes)
Regular Expressions specify sets of sequences that match a pattern.
Ex: a[bc]a matches "aba" and "aca"
In addition to literals like a and b in the last example, regular expressions provide quantifiers like * (0 or more), + (1 or more), ? (0 or 1) and {N,M} (between N and M):
Ex: a[bc]*a matches "aa", "aba", "acca", "acbcba" etc
As well as grouping constructions like character classes [xy], compound literals like (this)+, and logical relations, like | which means "or" in (this|that)
Anchors match the beginning ^ and end $ of strings
![Page 6: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/6.jpg)
IUPAC DNA ambiguity codes as reg-ex classes
Pyrimidines Y = [CT]
PuRines R = [AG]
Strong S = [CG]
Weak W = [AT]
Keto K = [GT]
aMino M = [AC]
B B = [CGT] (one letter greater than A=not-A)
D D = [AGT]
H H = [ACT]
V V = [ACG]
Any base N = [ACGT]
![Page 7: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/7.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
Begina [bc] a
End
Ex: a[bc]+a matching "ghghgacbah"
[bc]
[^bc]
[^bc][^a]
![Page 8: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/8.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
ghstu…a [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
![Page 9: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/9.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
hstua…a [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
![Page 10: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/10.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
stuac…a [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
![Page 11: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/11.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
tuacb…a [bc] a
End
Ex: a[bc]+a matching "ghstugacbah"
[bc]
[^bc]
[^bc][^a]
![Page 12: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/12.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
uacbaha [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
![Page 13: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/13.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
acbaha [bc]
[bc]
aEnd
[^bc]
[^bc]
Ex: a[bc]+a matching "ghstuacbah"
[^a]
![Page 14: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/14.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
Begina [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
cbah
[bc]
[^bc]
[^bc][^a]
![Page 15: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/15.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
a [bc] aEnd
Ex: a[bc]+a matching "ghstuacbah"
bah
[bc]
[^bc]
[^bc][^a]
![Page 16: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/16.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
a [bc] aEnd
Ex: a[bc]+a matching "ghstuacbah"
ah
[bc]
[^bc]
[^bc][^a]
![Page 17: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/17.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
a [bc] ah
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
![Page 18: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/18.jpg)
Regular Expressions are like machines that eat sequences one letter at a time
a [bc] aMATCH!
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
![Page 19: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/19.jpg)
Motifs are almost always either too selective or too specific
The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins
[GA]x(4)GK[ST]
Prob. of this motif ≈ (1/10)(1/20)(1/20)(1/10) = 0.000025
Expected number of matches in database with 3.2 x108 residues: about 8000!
About half of the proteins that match this motif are not NTPases of the P-loop class. (Lack of specificity)
![Page 20: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/20.jpg)
Motifs are almost always either too selective or too specific
[GA]x(4)GK[ST]
Larger and larger alignments of true members of the classgive more and more exceptions to the rule (lack of sensitivity)
Extending the rule ([GAT]x(4)[GAF][KTL][STG]) leads to loss of specificity
![Page 21: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/21.jpg)
A better way to model motifs
REGULAR EXPRESSIONS “(TTR[ATC]WT) N{15,22} (TRWWAT)”Can find alternative members of a classTreat alternative character states as equally likely.Treat all spacer lengths as equally likely.
PROFILES (Position-Specific Score Matrices)
![Page 22: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/22.jpg)
Profiles turn alignments into probabilistic models
![Page 23: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/23.jpg)
A graphical view of the same profile:
CCGTL…CGHSV…GCGSL…CGGTL…CCGSS…
G
C
H
GS
T
…C
GS
L
M
![Page 24: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/24.jpg)
You can also allow for unobserved residues or bases in a profile by giving them small probabilities:
G
A
T
GC
T
…A
GC
T
A
C
TG
![Page 25: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/25.jpg)
The probability that a sequence matches a profile P is the product of its parts:
G0.1
A0.7
T0.1
G0.8 C
0.7
T0.2
A0.8
G0.2 C
0.2
T0.6
A0.1
Ex: p(AAGCT | P) = p(A) x p(A) x p(G) x p(C) x p(T) = 0.8 x 0.7 x 0.8 x 0.7 x
0.6 = 0.18
P
![Page 26: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/26.jpg)
In practice, we compare this probability to that of matching a null model
G
A
T
GC
T
A
G C
T
A
G
A
C
T
G
A
C
T
G
A
C
T
G
A
C
T
G
A
C
T
![Page 27: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/27.jpg)
The null model is usually based on a composition.
G0.25
A0.25
C0.25
T0.25
G0.1
A0.7
T0.1
G0.8 C
0.7
T0.2
A0.8
G0.2 C
0.2
T0.6
A0.1
No positional information need be taken into account.
![Page 28: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/28.jpg)
Example: probabilities of AAGCT with the two models
G0.25
A0.25
C0.25
T0.25
G0.1
A0.7
T0.1
G0.8 C
0.7
T0.2
…A0.8
G0.2 C
0.2
T0.6
A0.1
p = 0.18
p = 0.255 = 0.00098
![Page 29: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/29.jpg)
Example: odds ratio of AAGCT with the two models
G0.25
A0.25
C0.25
T0.25
G0.1
A0.7
T0.1
G0.8 C
0.7
T0.2
…A0.8
G0.2 C
0.2
T0.6
A0.1
p = 0.18
p = 0.255 = 0.00098
The odds ratio is 0.18 / 0.00098 ≈ 184. It is 184 times more likely that AAGCT matches the profile than the null model!
![Page 30: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/30.jpg)
Like with substitution scoring matrices, we prefer the log-odds as a profile score
€
log2
Pr(AAGCT |P)
Pr(AAGCT | null)= log2(
0.18
0.00098) = log2(184) = 7.5
A positive log-odds (score) indicates a match.
![Page 31: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/31.jpg)
Digression: interpreting BLAST results
The bit score is a scaled log-odds of homology versus chance
![Page 32: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/32.jpg)
Digression: interpreting BLAST results
E value is the expected number of hits with scores at least S
![Page 33: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/33.jpg)
A better way to model motifs
REGULAR EXPRESSIONS “(TTR[ATC]WT) N{15,22} (TRWWAT)”Can find alternative members of a classTreat alternative character states as equally likely.Treat all spacer lengths as equally likely.
PROFILES (Position-Specific Score Matrices)Turn a multiple sequence alignment into a multidimensional (by
position) multinomial distribution.Explicit accounting of observed character statesCannot handle gaps (separate models must be made for different
spacer length -- O’Neill and Chiafari 1989)Can't be used to make alignments
![Page 34: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/34.jpg)
Hidden Markov Models
A Hidden Markov Model is a machine that can either parse or emit a family of sequences according to a Markov model
The same symbols can put the machine in different states, (A,C,T,G can be in a promoter, a codon, a terminator, etc.) so we say the states are “hidden”
Example: The Dice Factory
P(2) = 1/6
P(1) = 1/6
P(3) = 1/6
P(4) = 1/6
P(5) = 1/6
P(6) = 1/6
P(2) = 1/10
P(1) = 3/6
P(3) = 1/10
P(4) = 1/10
P(5) = 1/10
P(6) = 1/10
FAIR BIASED
0.99 0.70
0.01
0.30
...11452161621233453261432152211121611112211...
GENERATED
PREDICTED
![Page 35: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/35.jpg)
A Profile HMM is a profile with gaps
G
AT
G C
TA
G C
T
A
![Page 36: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/36.jpg)
A Profile HMM is a profile with gaps
G
AT
G C
TA
G C
T
A
insertions
![Page 37: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/37.jpg)
A Profile HMM is a profile with gaps
G
AT
G C
TA
G C
T
A
deletions
![Page 38: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/38.jpg)
A Profile HMM is a profile with gaps
G
AT
G C
TA
G C
T
A
insertions
deletions
![Page 39: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/39.jpg)
The HMMer Null Model (composition of insertions may be set by user, eg to match genome)
G0.25
A0.25
C0.25
T0.25
![Page 40: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/40.jpg)
The Plan 7 architecture in HMMer
Permit local matches to sequence
Permit repeated matches to sequence
Permit local matches to model
![Page 41: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/41.jpg)
HMMer2 (pronounced 'hammer', as in, “Why BLAST if you can hammer?”)
![Page 42: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/42.jpg)
The HMMer2 design separates models from algorithms
With the same alignment or model design, you can easily change the search algorithm (encoded in the HMM) to do:
Multihit Global alignments of model to sequence
Multihit Smith-Waterman (local with respect to both model and sequence, multiple non-overlapping hits to sequence allowed)
Single (best) hit variants of both of the above.
![Page 43: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/43.jpg)
This separation of model from algorithm provides a ready framework for sequence analysis(programs provided in HMMer)
hmmalign Align sequences to an existing model.
hmmbuild Build a model from a multiple sequence alignment.
hmmcalibrate Takes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E-values).
hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles.
hmmemit Emit sequences probabilistically from a profile HMM.
hmmfetch Get a single model from an HMM database.
hmmindex Index an HMM database.
hmmpfam Search an HMM database for matches to a query sequence.
hmmsearch Search a sequence database for matches to an HMM.
![Page 44: Introduction to Probabilistic Sequence Models: Theory and Applications](https://reader035.fdocuments.in/reader035/viewer/2022062409/568151c8550346895dbffda7/html5/thumbnails/44.jpg)
HMMer2 format can be automatically converted for use with SAM