Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
-
Upload
bartholomew-walton -
Category
Documents
-
view
218 -
download
2
Transcript of Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Profile HMMs for sequence Profile HMMs for sequence families and Viterbi equationsfamilies and Viterbi equations
Linda Muselaars and Miranda Stobbe
Linda Muselaars and Miranda Stobbe
2
Example alignmentExample alignment
HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV-
HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL-
MYG_PHYCA MKASEDLKKHGVTVLTALGAILKK--
GLB3_CHITP IKGTAPFETHANRIVGFFSKIIGEL-
GLB5_PETMA LKKSADVRWHAERIINAVNDAVASM-
LGB2_LUPLU PQNNPELQAHAGKVFKLVYEAAIQLQ
GLB1_GLYDI ---DPGVAALGAKVLAQIGVAVSHL-
Linda Muselaars and Miranda Stobbe
3
Overview chapter 5Overview chapter 5
Ungapped score matrices. Adding insert and delete states to obtain profile
HMMs. Deriving profile HMMs from multiple alignments Searching with profile HMMs. Profile HMM variants for non-global alignments. More on estimation of probabilities. Optimal model construction. Weighting training sequences.
Linda Muselaars and Miranda Stobbe
4
Overview chapter 5Overview chapter 5
Ungapped score matrices. Adding insert and delete states to obtain profile
HMMs. Deriving profile HMMs from multiple alignments Searching with profile HMMs. Profile HMM variants for non-global alignments. More on estimation of probabilities. Optimal model construction. Weighting training sequences.
Linda Muselaars and Miranda Stobbe
5
Key-issuesKey-issues
Identifying the relationship of an individual sequence to a sequence family.
How to build a profile HMM.Use profile HMMs to detect potential
membership in a family.Use profile HMMs to give an alignment of
a sequence to the family.
Linda Muselaars and Miranda Stobbe
6
Key-issues (2)Key-issues (2)
Lollypops for a valuable (up to the speakers to decide) contribution to this lecture.
Linda Muselaars and Miranda Stobbe
7
Needed theoryNeeded theory
Emission probabilities.Silent states.Pair HMMs.The Viterbi algorithm.The Forward algorithm.
Linda Muselaars and Miranda Stobbe
8
ContentsContents
Ungapped score matrices. Adding insert and delete states to obtain profile
HMMs. Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles– Basic profile HMM parameterisation
Searching with profile HMMs. Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
9
Example alignmentExample alignment
HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV-
HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL-
MYG_PHYCA MKASEDLKKHGVTVLTALGAILKK--
GLB3_CHITP IKGTAPFETHANRIVGFFSKIIGEL-
GLB5_PETMA LKKSADVRWHAERIINAVNDAVASM-
LGB2_LUPLU PQNNPELQAHAGKVFKLVYEAAIQLQ
GLB1_GLYDI ---DPGVAALGAKVLAQIGVAVSHL-
*********************
Linda Muselaars and Miranda Stobbe
10
Ungapped regionsUngapped regions
Gaps tend to line up.We can consider models for ungapped
regions.Specify indepependent probabilities ei(a).
But of course: log-odds ratio!Position specific score matrix.
L
iii xeMxP
1
)()|(
Linda Muselaars and Miranda Stobbe
11
DrawbacksDrawbacks
Multiple alignments do have gaps.Need to be accounted for.For example: BLOCKS database, with
combined scores of ungapped regions.We will develop a single probabilistic
model for the whole extent of the alignment.
Linda Muselaars and Miranda Stobbe
12
ContentsContents
Ungapped score matrices. Adding insert and delete states to obtain profile
HMMs. Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles– Basic profile HMM parameterisation
Searching with profile HMMs. Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
13
Short reviewShort review
Emission probabilities:
the probability that a certain symbol is seen when in certain state k.
Silent states:
states that do not emit symbols in an HMM.
Linda Muselaars and Miranda Stobbe
14
Building the model (1)Building the model (1)
We need position sensitive gap scores.HMM with repetitive structure of (match)
states.Transitions of probability 1.Emmision probabilities: eMi(a).
Begin EndMj....
..
..
Linda Muselaars and Miranda Stobbe
15
Building the model (2)Building the model (2)
Deal with insertions: set of new states Ii.
Ii have emission distribution eIi(a).
Set to the background distribution qa.
Begin Mj End
Ij
Linda Muselaars and Miranda Stobbe
16
Building the model (3)Building the model (3)
Deal with deletions.Possibly forward jumps.For arbitrarily long gaps: silent states Dj .
Begin Mj End
Dj
Linda Muselaars and Miranda Stobbe
17
Costs for additional statesCosts for additional states
States for insertions: the sum of the costs of the transitions and emissions (M→ I, number of I→ I, I→ M).
States for deletions: the sum of the costs of an M→ D transition and a number of D→ D transitions and an D→ M transition.
Linda Muselaars and Miranda Stobbe
19
Comparison with pair HMMComparison with pair HMM
Xqxi
Mpxiyj
Yqyj
Begin End
Xqxi
Yqyj
Linda Muselaars and Miranda Stobbe
20
ContentsContents
Ungapped score matrices. Adding insert and delete states to obtain profile
HMMs. Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles– Basic profile HMM parameterisation
Searching with profile HMMs. Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
21
Non-probabilistic profilesNon-probabilistic profiles
Profile HMM without underlying probabilistic model.
Set scores to averages of standard substitution scores.
Anomalies: – Conservation of columns is not taken into
account.– Scores for gaps do not behave properly.
Linda Muselaars and Miranda Stobbe
22
ExampleExample
HBA_HUMAN ...VGA--HAGEY...HBB_HUMAN ...V----NVDEV...MYG_PHYCA ...VEA--DVAGH...GLB3_CHITP ...VKG------D...GLB5_PETMA ...VYS--TYETS...LGB2_LUPLU ...FNA--NIPKH...GLB1_GLYDI ...IAGADNGAGV...
*** *****The score for residue a in column 1 would be set to:
),I(7
1),F(
7
1),V(
7
5asasas
Linda Muselaars and Miranda Stobbe
23
Basic profile HMM Basic profile HMM parameterisationparameterisation
Objective: make the probability distribution peak around members of the family.
Available parameters:– Length of the model.– Transition and emission probabilities.
Linda Muselaars and Miranda Stobbe
24
Length of the modelLength of the model
Which multiple alignment columns do we assign to match states?
And which to insert states?Heuristic rule: Columns that consist for
more than 50% of gap characters should be modeled by insert states.
Linda Muselaars and Miranda Stobbe
25
Transition probability:
Emission probability:
In the limit this is an accurate and consistent estimation.
Pseudocount method: LaPlace’s rule.
Probability parametersProbability parameters
' 'l kl
klkl A
Aa
')'(
)()(
a k
kk aE
aEae
# of transitions from state k to state l
# of transitions from state k to any other state
Linda Muselaars and Miranda Stobbe
26
ExampleExample
Bat A G - - - C
Rat A - A G - C
Cat A G - A A -
Gnat - - A A A C
Goat A G - - - C
* * * *
Linda Muselaars and Miranda Stobbe
27
Example continuedExample continued
Begin A C G T
End
D2 D3
I2 I3I0
D1
I1
D4
I4
A C G T
A C G T
A C G T
A 5/8C 1/8G 1/8T 1/8
A 1/7C 1/7G 4/7T 1/7
A 3/7C 1/7G 2/7T 1/7
A 1/8C 5/8G 1/8T 1/8
M1 M2 M3 M4
aM1M2 = 4/7
aM1D2 = 2/7
aM1I1 = 1/7
Linda Muselaars and Miranda Stobbe
28
ContentsContents
Ungapped score matrices. Adding insert and delete states to obtain profile
HMMs. Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles– Basic profile HMM parameterisation
Searching with profile HMMs. Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
29
Searching with profile HMMsSearching with profile HMMs
Obtaining significant matches of a sequence to the profile HMM: – Viterbi algorithm: P(x, π*| M).– Forward algorithm: P(x | M).
Give an alignment of a sequence to the family.– Highest scoring, or Viterbi, alignment.
Linda Muselaars and Miranda Stobbe
30
Log-odds score of best path matching subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj:
Log-odds score of the best path ending in xi being emitted by Ij:
The best path ending in state Dj:
Pair HMM:
Viterbi equationsViterbi equations
)(I iV j
)(D iV j
)(M iV j
);1,1()1(
),1,1()1(
),1,1()21(
max),(),(Y
X
M
M
jiv
jiv
jiv
yxpjiv ji
Linda Muselaars and Miranda Stobbe
31
;log)(
,log)(
,log)(
max)(
;log)1(
,log)1(
,log)1(
max)(
log)(
;log)1(
,log)1(
,log)1(
max)(
log)(
DDD
1
DII
1
DMM
1
D
IDD
III
IMM
II
MDD
1
MII
1
MMM
1MM
1
1
1
1
1
1
jj
jj
jj
jj
jj
jj
i
j
jj
jj
jj
i
j
aiV
aiV
aiV
iV
aiV
aiV
aiV
q
xeiV
aiV
aiV
aiV
q
xeiV
j
j
j
j
j
j
j
x
i
j
j
j
j
x
i
j
Viterbi equationsViterbi equations
Linda Muselaars and Miranda Stobbe
32
Forward algorithmForward algorithm
))];(exp(
))(exp(log))(exp(log[)(
))];1(exp())1(exp(log
))1(exp(log[)(
log)(
))];1(exp())1(exp(
))1(exp(log[)(
log)(
D1DD
I1DI
M1DM
D
DID
III
MIM
II
D1MD
I1MI
M1MM
MM
1
11
11
1
iFa
iFaiFaiF
iFaiFa
iFaq
xeiF
iFaiFa
iFaq
xeiF
j
jjj
jj
jx
i
j
jj
jx
i
j
jj
jjjj
jjjj
jj
i
j
jjjj
jj
i
j
Linda Muselaars and Miranda Stobbe
33
Initialisation and terminationInitialisation and termination
Viterbi algorithm:– Initialisation: – Termination:
Forward algorithm:– Initialisation:– Termination:
0)0(M0 V
;log)1(
,log)1(
,log)1(
max)(
1
11
1
MDD
MII
MMM
M1
LL
Lj
LL
anV
anV
anV
nV
L
L
L
L
0)0(M0 F
))]1(exp())1(exp(
))1(exp(log[)(D
MDI
MI
MMM
M1
11
1
nFanFa
nFanF
LL
LL
LLLL
LL
Linda Muselaars and Miranda Stobbe
34
Alternative to log-odds scoringAlternative to log-odds scoring
Log Likelihood score (LL score)Strongly length dependent.Solutions:
– Divide by sequence length– Z-score
Which method is preferred?
Linda Muselaars and Miranda Stobbe
41
ContentsContents
Ungapped score matrices. Adding insert and delete states to obtain profile
HMMs. Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles– Basic profile HMM parameterisation
Searching with profile HMMs. Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
42
Flanking model statesFlanking model states
Used to model the flanking sequences to the actual profile match itself.
Extra probabilities needed:– Emission probability: qa.
– ‘Looping’ transition probability: (1 - η).– Transition probability from left flanking state:
depends on application.
Linda Muselaars and Miranda Stobbe
43
Model for local alignmentModel for local alignmentSmith-Waterman style
Begin Mj End
Ij
Dj
Begin End
Q Q
Linda Muselaars and Miranda Stobbe
44
Model for overlap matchesModel for overlap matches
Begin Mj End
IjQ
Dj
Q
Linda Muselaars and Miranda Stobbe
45
Model for repeat matchesModel for repeat matches
Begin Mj End
Ij
Dj
Begin EndQ
Linda Muselaars and Miranda Stobbe
46
SummarySummary
Construction of a profile HMM for different kinds of alignments.
Use profile HMMs to detect potential membership in a family.
Use profile HMMs to give an alignment of a sequence to the family.