Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.

47
Profile HMMs for Profile HMMs for sequence families and sequence families and Viterbi equations Viterbi equations Linda Muselaars and Miranda Stobbe

Transcript of Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.

Profile HMMs for sequence Profile HMMs for sequence families and Viterbi equationsfamilies and Viterbi equations

Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe

2

Example alignmentExample alignment

HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV-

HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL-

MYG_PHYCA MKASEDLKKHGVTVLTALGAILKK--

GLB3_CHITP IKGTAPFETHANRIVGFFSKIIGEL-

GLB5_PETMA LKKSADVRWHAERIINAVNDAVASM-

LGB2_LUPLU PQNNPELQAHAGKVFKLVYEAAIQLQ

GLB1_GLYDI ---DPGVAALGAKVLAQIGVAVSHL-

Linda Muselaars and Miranda Stobbe

3

Overview chapter 5Overview chapter 5

Ungapped score matrices. Adding insert and delete states to obtain profile

HMMs. Deriving profile HMMs from multiple alignments Searching with profile HMMs. Profile HMM variants for non-global alignments. More on estimation of probabilities. Optimal model construction. Weighting training sequences.

Linda Muselaars and Miranda Stobbe

4

Overview chapter 5Overview chapter 5

Ungapped score matrices. Adding insert and delete states to obtain profile

HMMs. Deriving profile HMMs from multiple alignments Searching with profile HMMs. Profile HMM variants for non-global alignments. More on estimation of probabilities. Optimal model construction. Weighting training sequences.

Linda Muselaars and Miranda Stobbe

5

Key-issuesKey-issues

Identifying the relationship of an individual sequence to a sequence family.

How to build a profile HMM.Use profile HMMs to detect potential

membership in a family.Use profile HMMs to give an alignment of

a sequence to the family.

Linda Muselaars and Miranda Stobbe

6

Key-issues (2)Key-issues (2)

Lollypops for a valuable (up to the speakers to decide) contribution to this lecture.

Linda Muselaars and Miranda Stobbe

7

Needed theoryNeeded theory

Emission probabilities.Silent states.Pair HMMs.The Viterbi algorithm.The Forward algorithm.

Linda Muselaars and Miranda Stobbe

8

ContentsContents

Ungapped score matrices. Adding insert and delete states to obtain profile

HMMs. Deriving profile HMMs from multiple alignments.

– Non-probabilistic profiles– Basic profile HMM parameterisation

Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe

9

Example alignmentExample alignment

HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV-

HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL-

MYG_PHYCA MKASEDLKKHGVTVLTALGAILKK--

GLB3_CHITP IKGTAPFETHANRIVGFFSKIIGEL-

GLB5_PETMA LKKSADVRWHAERIINAVNDAVASM-

LGB2_LUPLU PQNNPELQAHAGKVFKLVYEAAIQLQ

GLB1_GLYDI ---DPGVAALGAKVLAQIGVAVSHL-

*********************

Linda Muselaars and Miranda Stobbe

10

Ungapped regionsUngapped regions

Gaps tend to line up.We can consider models for ungapped

regions.Specify indepependent probabilities ei(a).

But of course: log-odds ratio!Position specific score matrix.

L

iii xeMxP

1

)()|(

Linda Muselaars and Miranda Stobbe

11

DrawbacksDrawbacks

Multiple alignments do have gaps.Need to be accounted for.For example: BLOCKS database, with

combined scores of ungapped regions.We will develop a single probabilistic

model for the whole extent of the alignment.

Linda Muselaars and Miranda Stobbe

12

ContentsContents

Ungapped score matrices. Adding insert and delete states to obtain profile

HMMs. Deriving profile HMMs from multiple alignments.

– Non-probabilistic profiles– Basic profile HMM parameterisation

Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe

13

Short reviewShort review

Emission probabilities:

the probability that a certain symbol is seen when in certain state k.

Silent states:

states that do not emit symbols in an HMM.

Linda Muselaars and Miranda Stobbe

14

Building the model (1)Building the model (1)

We need position sensitive gap scores.HMM with repetitive structure of (match)

states.Transitions of probability 1.Emmision probabilities: eMi(a).

Begin EndMj....

..

..

Linda Muselaars and Miranda Stobbe

15

Building the model (2)Building the model (2)

Deal with insertions: set of new states Ii.

Ii have emission distribution eIi(a).

Set to the background distribution qa.

Begin Mj End

Ij

Linda Muselaars and Miranda Stobbe

16

Building the model (3)Building the model (3)

Deal with deletions.Possibly forward jumps.For arbitrarily long gaps: silent states Dj .

Begin Mj End

Dj

Linda Muselaars and Miranda Stobbe

17

Costs for additional statesCosts for additional states

States for insertions: the sum of the costs of the transitions and emissions (M→ I, number of I→ I, I→ M).

States for deletions: the sum of the costs of an M→ D transition and a number of D→ D transitions and an D→ M transition.

Linda Muselaars and Miranda Stobbe

18

Full modelFull model

Begin Mj End

Ij

Dj

Linda Muselaars and Miranda Stobbe

19

Comparison with pair HMMComparison with pair HMM

Xqxi

Mpxiyj

Yqyj

Begin End

Xqxi

Yqyj

Linda Muselaars and Miranda Stobbe

20

ContentsContents

Ungapped score matrices. Adding insert and delete states to obtain profile

HMMs. Deriving profile HMMs from multiple alignments.

– Non-probabilistic profiles– Basic profile HMM parameterisation

Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe

21

Non-probabilistic profilesNon-probabilistic profiles

Profile HMM without underlying probabilistic model.

Set scores to averages of standard substitution scores.

Anomalies: – Conservation of columns is not taken into

account.– Scores for gaps do not behave properly.

Linda Muselaars and Miranda Stobbe

22

ExampleExample

HBA_HUMAN ...VGA--HAGEY...HBB_HUMAN ...V----NVDEV...MYG_PHYCA ...VEA--DVAGH...GLB3_CHITP ...VKG------D...GLB5_PETMA ...VYS--TYETS...LGB2_LUPLU ...FNA--NIPKH...GLB1_GLYDI ...IAGADNGAGV...

*** *****The score for residue a in column 1 would be set to:

),I(7

1),F(

7

1),V(

7

5asasas

Linda Muselaars and Miranda Stobbe

23

Basic profile HMM Basic profile HMM parameterisationparameterisation

Objective: make the probability distribution peak around members of the family.

Available parameters:– Length of the model.– Transition and emission probabilities.

Linda Muselaars and Miranda Stobbe

24

Length of the modelLength of the model

Which multiple alignment columns do we assign to match states?

And which to insert states?Heuristic rule: Columns that consist for

more than 50% of gap characters should be modeled by insert states.

Linda Muselaars and Miranda Stobbe

25

Transition probability:

Emission probability:

In the limit this is an accurate and consistent estimation.

Pseudocount method: LaPlace’s rule.

Probability parametersProbability parameters

' 'l kl

klkl A

Aa

')'(

)()(

a k

kk aE

aEae

# of transitions from state k to state l

# of transitions from state k to any other state

Linda Muselaars and Miranda Stobbe

26

ExampleExample

Bat A G - - - C

Rat A - A G - C

Cat A G - A A -

Gnat - - A A A C

Goat A G - - - C

* * * *

Linda Muselaars and Miranda Stobbe

27

Example continuedExample continued

Begin A C G T

End

D2 D3

I2 I3I0

D1

I1

D4

I4

A C G T

A C G T

A C G T

A 5/8C 1/8G 1/8T 1/8

A 1/7C 1/7G 4/7T 1/7

A 3/7C 1/7G 2/7T 1/7

A 1/8C 5/8G 1/8T 1/8

M1 M2 M3 M4

aM1M2 = 4/7

aM1D2 = 2/7

aM1I1 = 1/7

Linda Muselaars and Miranda Stobbe

28

ContentsContents

Ungapped score matrices. Adding insert and delete states to obtain profile

HMMs. Deriving profile HMMs from multiple alignments.

– Non-probabilistic profiles– Basic profile HMM parameterisation

Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe

29

Searching with profile HMMsSearching with profile HMMs

Obtaining significant matches of a sequence to the profile HMM: – Viterbi algorithm: P(x, π*| M).– Forward algorithm: P(x | M).

Give an alignment of a sequence to the family.– Highest scoring, or Viterbi, alignment.

Linda Muselaars and Miranda Stobbe

30

Log-odds score of best path matching subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj:

Log-odds score of the best path ending in xi being emitted by Ij:

The best path ending in state Dj:

Pair HMM:

Viterbi equationsViterbi equations

)(I iV j

)(D iV j

)(M iV j

);1,1()1(

),1,1()1(

),1,1()21(

max),(),(Y

X

M

M

jiv

jiv

jiv

yxpjiv ji

Linda Muselaars and Miranda Stobbe

31

;log)(

,log)(

,log)(

max)(

;log)1(

,log)1(

,log)1(

max)(

log)(

;log)1(

,log)1(

,log)1(

max)(

log)(

DDD

1

DII

1

DMM

1

D

IDD

III

IMM

II

MDD

1

MII

1

MMM

1MM

1

1

1

1

1

1

jj

jj

jj

jj

jj

jj

i

j

jj

jj

jj

i

j

aiV

aiV

aiV

iV

aiV

aiV

aiV

q

xeiV

aiV

aiV

aiV

q

xeiV

j

j

j

j

j

j

j

x

i

j

j

j

j

x

i

j

Viterbi equationsViterbi equations

Linda Muselaars and Miranda Stobbe

32

Forward algorithmForward algorithm

))];(exp(

))(exp(log))(exp(log[)(

))];1(exp())1(exp(log

))1(exp(log[)(

log)(

))];1(exp())1(exp(

))1(exp(log[)(

log)(

D1DD

I1DI

M1DM

D

DID

III

MIM

II

D1MD

I1MI

M1MM

MM

1

11

11

1

iFa

iFaiFaiF

iFaiFa

iFaq

xeiF

iFaiFa

iFaq

xeiF

j

jjj

jj

jx

i

j

jj

jx

i

j

jj

jjjj

jjjj

jj

i

j

jjjj

jj

i

j

Linda Muselaars and Miranda Stobbe

33

Initialisation and terminationInitialisation and termination

Viterbi algorithm:– Initialisation: – Termination:

Forward algorithm:– Initialisation:– Termination:

0)0(M0 V

;log)1(

,log)1(

,log)1(

max)(

1

11

1

MDD

MII

MMM

M1

LL

Lj

LL

anV

anV

anV

nV

L

L

L

L

0)0(M0 F

))]1(exp())1(exp(

))1(exp(log[)(D

MDI

MI

MMM

M1

11

1

nFanFa

nFanF

LL

LL

LLLL

LL

Linda Muselaars and Miranda Stobbe

34

Alternative to log-odds scoringAlternative to log-odds scoring

Log Likelihood score (LL score)Strongly length dependent.Solutions:

– Divide by sequence length– Z-score

Which method is preferred?

Linda Muselaars and Miranda Stobbe

35

Linda Muselaars and Miranda Stobbe

36

DemoDemo

Linda Muselaars and Miranda Stobbe

37

Part of the profile HMMPart of the profile HMM

Linda Muselaars and Miranda Stobbe

38

ScoringScoring

Linda Muselaars and Miranda Stobbe

39

Part of the multiple alignmentPart of the multiple alignment

Linda Muselaars and Miranda Stobbe

40

Relative frequenciesRelative frequencies

Linda Muselaars and Miranda Stobbe

41

ContentsContents

Ungapped score matrices. Adding insert and delete states to obtain profile

HMMs. Deriving profile HMMs from multiple alignments.

– Non-probabilistic profiles– Basic profile HMM parameterisation

Searching with profile HMMs. Profile HMM variants for non-global alignments.

Linda Muselaars and Miranda Stobbe

42

Flanking model statesFlanking model states

Used to model the flanking sequences to the actual profile match itself.

Extra probabilities needed:– Emission probability: qa.

– ‘Looping’ transition probability: (1 - η).– Transition probability from left flanking state:

depends on application.

Linda Muselaars and Miranda Stobbe

43

Model for local alignmentModel for local alignmentSmith-Waterman style

Begin Mj End

Ij

Dj

Begin End

Q Q

Linda Muselaars and Miranda Stobbe

44

Model for overlap matchesModel for overlap matches

Begin Mj End

IjQ

Dj

Q

Linda Muselaars and Miranda Stobbe

45

Model for repeat matchesModel for repeat matches

Begin Mj End

Ij

Dj

Begin EndQ

Linda Muselaars and Miranda Stobbe

46

SummarySummary

Construction of a profile HMM for different kinds of alignments.

Use profile HMMs to detect potential membership in a family.

Use profile HMMs to give an alignment of a sequence to the family.

Linda Muselaars and Miranda Stobbe

47

BLAST versus profile HMM

Discussion subjectDiscussion subject