MarkovChainsAndHMMs Complete

24
Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 193 12 Markov chains and Hidden Markov Models Recommended reading: S. Durbin, S. Eddy, A. Krogh and G. Mitchison: “Biological Sequence Analysis, Cambridge”, 1998 W. Ewens and G. Grant: “Statistical Methods in Bioinformatics” Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recogni- tion. Proc. IEEE. 77, 257-286, 1989. A. Krogh, M. Brown, IS. Mian, K. Sj¨ olander, D. Haussler. Hidden Markov models in computa- tional biology: Applications to protein modeling. J. Mol. Biol. 235:1501-1531, 1994. 12.1 CpG islands Double stranded DNA: ...ApCpCpApTpGpApTpGpCpApGpGpApCpTpTpCpCpApTpCpGpTpTpCpGpCp... ...||||||||||||||||||||||||||||... ...TpGpGpTpApCpTpApCpGpTpCpCpTpGpApApGpGpTpApGpCpApApGpCpGp... Given 4 nucleotides if probability of occurrence is equal for each nucleotide, then p(a) 1/4 for each a ∈{A, G, C, T }. Thus, probability of occurrence of a dinucleotide is 1/16. However, the frequencies of dinucleotides in DNA sequences vary widely. In particular, the dinucleotide CG is typically underepresented, ie. typically < 1/16. CG is the least frequent dinucleotide, because the C in a CpG pair is often modified by methylation (that is, an H -atom is replaced by a CH 3 -group) and the methyl-C has the tendency to mutate into T afterwards. Upstream of a gene however, the methylation process is suppressed in short regions of the genome of length 100-5000. These areas are called CpG-islands and they are characterized by the fact that we see more CpG-pairs in them than elsewhere. So, finding the CpG islands in a genome is an important problem. CpG-islands are useful marks for genes in organisms whose genomes contain 5-methyl-cytosine. CpG-islands in the promoter regions of genes play an important role in the deactivation of a copy of the X-chromosome in females, in imprinting and in the deactivation of intra-genomic parasites. Classical definition: A CpG island is DNA sequence of length about 200bp with a C + G content of 50% and a ratio of observed-to-expected number of CpG’s that is above 0.6. (Gardiner-Garden & Frommer, 1987) According to a recent study 1 , human chromosomes 21 and 22 contain about 1100 CpG-islands and about 750 genes. 12.1.1 Questions 1. Discrimination problem: Given a short segment of genomic sequence. How can we decide whether this segment comes from a CpG-island or not? 2. Localisation problem: Given a long segment of genomic sequence. How can we find all contained CpG-islands? 1 Comprehensive analysis of CpG islands in human chromosomes 21 and 22, D. Takai & P. A. Jones, PNAS, March 19, 2002

description

Bioinformatics lecture notes

Transcript of MarkovChainsAndHMMs Complete

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 193

12 Markov chains and Hidden Markov Models

Recommended reading:

• S. Durbin, S. Eddy, A. Krogh and G. Mitchison: “Biological Sequence Analysis, Cambridge”,1998

• W. Ewens and G. Grant: “Statistical Methods in Bioinformatics”

• Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recogni-tion. Proc. IEEE. 77, 257-286, 1989.

• A. Krogh, M. Brown, IS. Mian, K. Sjolander, D. Haussler. Hidden Markov models in computa-tional biology: Applications to protein modeling. J. Mol. Biol. 235:1501-1531, 1994.

12.1 CpG islands

Double stranded DNA:

...ApCpCpApTpGpApTpGpCpApGpGpApCpTpTpCpCpApTpCpGpTpTpCpGpCp...

...| | | | | | | | | | | | | | | | | | | | | | | | | | | | ...

...TpGpGpTpApCpTpApCpGpTpCpCpTpGpApApGpGpTpApGpCpApApGpCpGp...

Given 4 nucleotides if probability of occurrence is equal for each nucleotide, then p(a) ≈ 1/4 foreach a ∈ {A,G,C, T}. Thus, probability of occurrence of a dinucleotide is ≈ 1/16. However, thefrequencies of dinucleotides in DNA sequences vary widely. In particular, the dinucleotide CG istypically underepresented, ie. typically < 1/16.

CG is the least frequent dinucleotide, because the C in a CpG pair is often modified by methylation(that is, an H-atom is replaced by a CH3-group) and the methyl-C has the tendency to mutate intoT afterwards.

Upstream of a gene however, the methylation process is suppressed in short regions of the genome oflength 100-5000. These areas are called CpG-islands and they are characterized by the fact that wesee more CpG-pairs in them than elsewhere.

So, finding the CpG islands in a genome is an important problem.

CpG-islands are useful marks for genes in organisms whose genomes contain 5-methyl-cytosine.

CpG-islands in the promoter regions of genes play an important role in the deactivation of a copy ofthe X-chromosome in females, in imprinting and in the deactivation of intra-genomic parasites.

Classical definition: A CpG island is DNA sequence of length about 200bp with a C + G contentof 50% and a ratio of observed-to-expected number of CpG’s that is above 0.6. (Gardiner-Garden &Frommer, 1987)

According to a recent study1, human chromosomes 21 and 22 contain about 1100 CpG-islands andabout 750 genes.

12.1.1 Questions

1. Discrimination problem: Given a short segment of genomic sequence. How can we decidewhether this segment comes from a CpG-island or not?

2. Localisation problem: Given a long segment of genomic sequence. How can we find allcontained CpG-islands?

1Comprehensive analysis of CpG islands in human chromosomes 21 and 22, D. Takai & P. A. Jones, PNAS,March 19, 2002

194 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

12.2 Markov chains

Our goal is to set up a probabilistic model for CpG-islands. Because pairs of consecutive nucleotidesare important in this context, we need a model in which the probability of one symbol depends on theprobability of its predecessor. This leads us to a Markov model and Markov chain.

Example:

A C

G T

Circles = states, e.g. with names A , C , G and T .Arrows = possible transitions between states, each labeled with a transition probability.

At each time point t Markov process occupies one of the states si in S. In each step t to t + 1the process either remains in the same state or moves to some other state in S. Further it doesthis in a probabilistic way: assume a time t the Markov random variable is in state si. We denotethe probability that at time t + 1 it is in state sj by pij = P(Xt+1 = sj |Xt = si), called transitionprobabilities.

All transition probabilities are grouped in a transition probability matrix:

P = [Pij ] =

to s1 to s2 · · · to sk

from s1 p11 p12 · · · p1k

from s2 p21 p22 · · · p2k...

......

...from sk pk1 pk2 · · · pkk

Any row in the matrix corresponds to the state from which the transition is made, and any columncorresponds to the state to which the transition is made. Thus

∑kj=1 pij = 1.

Definition 12.2.1 A (time-homogeneous) Markov model (of order 1) is a system (S, P ) consistingof a finite set of states S = {s1, s2, . . . , sk} and a transition matrix P = {pst} with

∑t∈S pst = 1 for

all s ∈ S, that determines the probability of the transition s → t as follows:

P(xi+1 = t | xi = s) = pst.

Definition 12.2.2 A Markov chain is a chain x0, x1, x2, . . . , xn, . . . of random variables, which takestates in the states set S such that

P(xn = s| ∩j<n xj) = P(xn = s|xn−1)

is true for all n > 0 and s ∈ S.

A Markov chain is called homogeneous, if the probabilities are not dependent on n. (At any time i thechain is in a specific state xi and at the tick of a clock the chain changes to state xi+1 according to the giventransition probabilities. We call these instantaneous changes).

12.2.1 Questions that a Markov chain can answer

Example weather in Tubingen, daily at midday: Possible states are rain, sun or clouds.

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 195

Transition probabilities:

R S CR .5 .1 .4S .2 .6 .2C .3 .3 .4

A Markov chain would be the observation of the weather: ...rrrrrrccsssssscscscccrrcrcssss...

Types of questions that the model can answer:

If it is sunny today, what is the probability that the sun will shine for the next seven days?

How large is the probability, that it will rain for a month?

12.2.2 Modeling the begin and end states

We have specified the transition matrix for a Markov chain, now we need to specify the initializationand termination of the chain. For the Markov chain we must give an initial probability P(x1) ofstarting in a particular state.

We add a begin state to the model that is labeled ’Begin’ and add this to the states set. We willalways assume that x0 = Begin holds. Then the probability of the first state in the Markov chain is

P(x1 = s) = pBegins = P(s),

where P(s) denotes the background probability of symbol s.

Similarly, we explicitly model the end of the sequence using an end state ’End’. Thus, the probabilitythat we end in state t is

P(End|xL = t) = ptEnd.

12.2.3 Probability of Markov chains

Given a sequence of states x = x1, x2, x3, . . . , xL. What is the probability that a Markov chain willstep through precisely this sequence of states?

P(x) = P(xL, xL−1, . . . , x1)

= P(xL | xL−1, . . . , x1)P(xL−1 | xL−2, . . . , x1) . . . P(X1),

(by repeated application of P(X, Y ) = P(X|Y )P(Y ))

= P(xL, | xL−1)P(xL−1 | xL−2) . . . P(x2 | x1)P(x1)

= P(x1)∏L

i=2 pxi−1xi

12.3 Extension of the model

Example:

A C

G T

eb

196 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

# Markov chain that generates CpG islands# (Source: DEKM98, p 50)# Number of states:6# State labels:A C G T * +# Transition matrix:0.1795 0.2735 0.4255 0.1195 0 0.0020.1705 0.3665 0.2735 0.1875 0 0.0020.1605 0.3385 0.3745 0.1245 0 0.0020.0785 0.3545 0.3835 0.1815 0 0.0020.2495 0.2495 0.2495 0.2495 0 0.0020.0000 0.0000 0.0000 0.0000 0 1.000

12.4 Using Markov chains for the discrimination problem

We will use the equation P(x) = P(x1)∏L

i=2 pxi−1xi to calculate the values of a likelihood ratio test.

In the case of the CpG-islands we will calculate the probability that a sequence comes from a CpG-island against the probability that it does not come from one.

First we need to determine the respective transition matrices. Transition matrices are generallycalculated from training sets. In our case the transition matrix P+ for a DNA sequence that comesfrom a CpG-island, is determined as follows:

p+st =

c+st∑

t′ c+st′

,

where cst is the number of positions in a training set of CpG-islands at which state s is followed bystate t.

Similarly the entries of the transition matrix P− are derived.

12.4.1 Two examples of Markov chains

# Markov chain for CpG islands # Markov chain for non-CpG islands# Number of states: # Number of states:4 4# State labels: # State labels:A C G T A C G T# Transition matrix P+: # Transition matrix P-:.1795 .2735 .4255 .1195 .2995 .2045 .2845 .2095.1705 .3665 .2735 .1875 .3215 .2975 .0775 .0775.1605 .3385 .3745 .1245 .2475 .2455 .2975 .2075.0785 .3545 .3835 .1815 .1765 .2385 .2915 .2915

12.4.2 Answering question 1

Given a short sequence x = (x1, x2, . . . , xL). Does it come from a CpG-island (model+)?

P(x | model+) = P(x1)L∏

i=2

p+xi−1xi

,

Or does it not come from a CpG-island (model−)?

P(x | model−) = P(x1)L∏

i=2

p−xi−1xi,

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 197

We calculate the log-odds ratio

S(x) = logP(x | model+)P(x | model−)

=L∑

i=2

logp+

xi−1xi

p−xi−1xi

=L∑

i=2

βxi−1xi

with βxi−1xi being the log likelihood ratios of corresponding transition probabilities. For the transitionmatrices above we calculate for example βAA = log(0.18/0.3). Often the base 2 log is used, in whichcase the unit is in bits.

If model+ and model− differ substantially then a typical CpG-island should have a higher probabilitywithin the model+ than in the model−. The log-odds ratio should become positive.

Generally we could use a threshold value c∗ and a test function to determine whether a sequence xcomes from a CpG-island:

φ∗(x) :={

1 if S(x) > c∗;0 if S(x) ≤ c∗;

where φ∗(x) = 1 indicates that x comes from a CpG-island.

Such a test is called Neyman-Pearson-Test.

12.5 Hidden Markov Models (HMM)

Motivation: Localisation problem, how to detect CpG-islands inside a long sequence x = x[1, L), L �0?

One idea is to use the Markov chain models derived above and apply them by calculating log-odds ratiofor a window of a fixed width w � L that is moved along the sequence and the score S(x[k, k + w)) isplotted for each starting position k (1 ≤ k ≤ L−w) Problems: it is hard to determine the boundariesof CpG-islands, which window size w should one choose?. . .

Approach: Merge the two Markov chains model+ and model− to obtain a so-called Hidden MarkovModel.

A hidden Markov model is similar to a discrete-time Markov chain, but more general and thus moreflexible. The main difference is that when states are visited, these “emit” letters from a fixed time-independent alphabet. In our case of the localisation of CpG-islands, we want to have both Markovchains within one model, with a small probability of switching from one chain to the other at eachtransition point. But here we face a problem: for each letter from our alphabet (ie. nucleotide) wehave two states, depending from which model they come. We resolve this by relabeling the states suchthat we have the states A+, C+, G+ and T+ which emit A, C, G and/or T within CpG-regions, andvice versa states A−, C−, G− and T− which emit A, C, G and/or T within non-CpG-regions.

Example:

A C TG

A C TG+ + + +

− − − −

(Additionally, we have all transitions between states in either of the two sets that carry over from thetwo Markov chains model+ and model−.) Hidden Markov models are a machine learning approach.The algorithms are presented with training data, from which parameters are estimated. Then thealgorithm is applied to test data.

198 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

The better the training data, the better the accuracy of the HMM algorithm. The HMM uses theestimated/learned parameters in the framework of dynamic programming to find the best explanationfor the experimental data.

12.5.1 Hidden Markov Models in bioinformatics

Roots of HMM theory can be traced back to the 1950s. However, to be applicable to practical problemsit needed Andrew Viterbi and Leonard Baum and colleagues to develop algorithms for decoding andparameter estimation in the late 1970s.

The first scientists to apply HMMs in bioinformatics were Pierre Baldi, Gary Churchill and DavidHaussler. That was at the beginning of the 1990s.

A review type article that is highly recommended for reading is:

A. Krogh, M. Brown, IS. Mian, K. Sjolander, D. Haussler. Hidden Markov models in computationalbiology: Applications to protein modeling. J. Mol. Biol. 235:1501-1531, 1994.

We now want to formally define a hidden Markov model. We need to distinguish the sequence ofstates from the sequence of symbols. We call the sequence of states the path and denote it by π. Thetransition within the path sequence follows a Markov chain, with transitions probabilities dependingonly on the previous state. The path sequence is hidden.

We observe a sequence x of symbols, where the symbols are emitted from the states with a certainemission probability. Since we decoupled the alphabet symbols from the states, we need these pa-rameters to model how a state can produce a symbol from a distribution over all possible symbols.Thus

ek(b) = P(xi = b|πi = k)

where π = π1π2 . . . πL is the (hidden) state sequence, x = x1 . . . xL the observed symbols sequence andek(b) is the probability that symbol b is seen when in state k.

Definition 12.5.1 A hidden Markov model (HMM) is a system M = (Σ, Q, P, e) consisting of

• an alphabet Σ, often also emission alphabet called,

• a set of states Q,

• a matrix P = {pkl} of transition probabilities pkl for k, l ∈ Q, and

• an emission probability ek(b) for every k ∈ Q and b ∈ Σ.

12.5.2 HMM for CpG-islands

In our example we get Σ = {A,G,C, T} for the alphabet,Q = {A+, C+, G+, T+, A−, C−, G−, T−} and possibly start and end states for the states set. Theentries of the transition matrix P should be close to the transition probabilities of the original twocomponent model, but with small chances of switching from the + to the − states and vice versa.Generally, the chance of switching from + to − is larger than vice versa.

In the case of the CpG-islands the emission probabilities are all 0 or 1 as can be seen from below.

# Number of states:9# Names of states (begin/end, A+, C+, G+, T+, A-, C-, G- and T-):0 A C G T a c g t# Number of symbols:4# Names of symbols:a c g t

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 199

# Transition matrix, probability to change from + to - (and vice versa) is 10E-40.0000 0.0725 0.1638 0.1788 0.0755 0.1322 0.1267 0.1226 0.12790.0010 0.1762 0.2683 0.4171 0.1175 0.0036 0.0055 0.0085 0.00240.0010 0.1672 0.3599 0.2680 0.1839 0.0034 0.0073 0.0055 0.00380.0010 0.1576 0.3319 0.3671 0.1224 0.0032 0.0068 0.0075 0.00250.0010 0.0773 0.3476 0.3759 0.1782 0.0016 0.0071 0.0077 0.00360.0010 0.0003 0.0002 0.0003 0.0002 0.2994 0.2046 0.2844 0.20960.0010 0.0003 0.0003 0.0001 0.0003 0.3214 0.2974 0.0778 0.30140.0010 0.0002 0.0002 0.0003 0.0002 0.2476 0.2455 0.2974 0.20760.0010 0.0002 0.0002 0.0003 0.0003 0.1766 0.2385 0.2914 0.2914# Emission probabilities:0 0 0 01 0 0 00 1 0 00 0 1 00 0 0 11 0 0 00 1 0 00 0 1 00 0 0 1

From now one we use 0 for the begin and end state.

12.6 Examples for HMMs

12.6.1 fair/loaded dice

Let us return to our example of the casino to illustrate the emission probabilities. An occasionallydishonest casino uses two kinds of dice, a fair one and a loaded one. With certain probabilities thecasino dealer changes from one type of die to the other. In the case of the loaded die, a 6 appears50% of the time.

Why is this now an example of an HMM?

The casino guest only observes the number rolled:

6 4 3 2 3 4 6 5 1 2 3 4 5 6 6 6 3 2 1 2 6 3 4 2 1 6 6...

Which die was used remains hidden:

F F F F F F F F F F F F L L L L L F F F F F F F F F F...

200 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

So the states are the two dice, while the emission alphabet are the numbers 1,2,3,4,5,6. The transitionprobabilities are the probabilities of switching between the two dice, while the emission probabilitiesare the probabilities of the numbers appearing dependent on the chosen die. The graph shows theprobabilities:

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.05

0.1

0.95 0.9

UnfairFair

Here the emission probabilities are those shown within the boxes.

12.6.2 Urns model

Given p urns U1, U2, . . . , Up. Each urn Ui contains ri red, gi green and bi blue balls. An urn Ui israndomly selected and from it a random ball k is taken (with replacement). The color of the ball k isreported.

r2 redg2 greenb2 blue

rp redgp greenbp blue

r1 redg1 greenb1 blue

...

Thus the (emission) alphabet is the color of the balls Σ = {r, g, b} and the (hidden) states set Q ={U1, U2, . . . , Up}. The HMM process generates a sequence x such as e.g.

r r g g b b g b g g g b b b r g r g b b b g g b g g b...

Again, (a part of) the actual state (namely which urn was chosen) is hidden.

12.6.3 HMM for the urns model

# Four urns# Number of states:5# Names of states:# (0 begin/end, and urns A-D)0 A B C D# Number of symbols:3# red, green, bluer g b# Transition matrix:0 .25 .25 .25 .250.01 .69 .30 0 00.01 0 .69 .30 00.01 0 0 .69 .300.01 .30 0 0 .69# Emission probabilities:0 0 0.8 .1 .1.2 .5 .3.1 .1 .8.3 .3 .4# EOF

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 201

12.6.4 Generation of data under an HMM

Generally a sequence can be generated from an HMM as follows: first a begin state π1 is chosenaccording to the probability p0i. In that state a symbol is emitted according to the emission probabilityeπ1 for that state. We then move to the next state π2 with probability pπ1i and so on. We wouldreport the sequence of emitted symbols.

Thus the general algorithm for generating a sequence of random observations under an HMM is

Algorithm

Start in state b (0).

While we have not entered state e (0):

Choose a new state using the transition probabilities

Choose a symbol using the emission probabilities and report it.

12.6.5 A sequence generated for the casino example

We use the fair/loaded HMM to generate a sequence of states and symbols:

x : 24335642611341666666526562426612134635535566462666636664253p : FFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLFFFFFFFFFFLLLLLLLLLLLLLFFFF

x : 35246363252521655615445653663666511145445656621261532516435p : FFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLFFLLLLLLLLLLLLLLFFFFFFFFF

x : 5146526666p : FFLLLLLLLL

Typical questions that arise are:

1. How probable is an observed sequence of data: P(x)?

2. If we can observe only the symbols, can we reconstruct the corresponding states? What is themost likely die sequence?

3. What is the probability that πi = k given the observed sequence x: P(πi = kx)?

12.7 Three Algorithms for HMMs

Let M be an HMM, x = x[0, L) a sequence of symbols. For the following two questions we willassume that the emission and transition probabilities as well as start and end probabilities have beenpredetermined.

(Q1) For x, determine the most probable sequence of states through M : Viterbi-algorithm

(Q2) Determine the probability that M generated x:P(x) = P(x | M): forward-algorithm

(Q3) Given x and perhaps some additional sequences of symbols, how do we train the parameters ofM? E.g., Baum-Welch-algorithm

202 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

12.8 Determining the probability, given the states and symbols

Definition 12.8.1 A path π = (π1, π2, . . . , πL) is a sequence of states in the model M .

Example: Assume we observe the sequence of symbols C G C G . Using our CpG-model there area number of state sequences that could have emitted this sequence, e.g.

(C+, G+, C+, G+), (C−, G−, C−, G−) and (C−, G+, C−, G+).

P(xi | πi)P(πi−1 → πi)

=

C G C GC+ G+ C+ G+

1 1 1 10.16 0.27 0.33 0.27

The probability for the sequence and the first state sequence π would be

P(x, π) = p0C+ × (1 · pC+G+ × 1 · pG+C+ × 1 · pC+G+ × 1 · pG+0)

For our model parameters above we thus get

P(C G C G , C+, G+, C+, G+) = 0.164× (1 · 0.268× 1 · 0.3319× 1 · 0.268× 1 · 0.001)

12.8.1 “Decoding” a sequence of symbols

Given a sequence of symbols x = (x1, . . . , xL) and a path π = (π1, . . . , πL) through M . The jointprobability is:

P(x, π) = p0π1

L∏i=1

eπi(xi)pπiπi+1 , (12.1)

with πL+1 = 0.

Equation (12.1) is obviously the equivalent to the Markov chain probability (see section 7.2.3).

Unfortunately, we usually do not know the path through the model.

Problem: We have observed a sequence x of symbols and would like to “decode” the sequence.

A path through the HMM determines which parts of the sequence x are classified as CpG-islands,such a classification of the observed symbols is called a decoding.

12.8.2 The most probable path

To solve the decoding problem, we want to determine the path π∗ that maximizes the probability ofhaving generated the sequence x of symbols, that is:

π∗ = argmaxπP(x, π)

This most probable path π∗ can be computed via a dynamic programming algorithm.

The idea is to represent the search for the optimal path by a weighted grid-type graph as shown belowfor two states. The vertices represent the states and each column the position of the sequence.

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 203

The edge weights are set such that the product of them for a path π equal P(x | π). The weight of anedge from (k, i) to (l, i + 1) ist given by el(xi+1) · pkl.

Why does that hold?

Let us compare the probability pk,i of a path π ending in (k, i) with pl,i+1:

pl,i+1 =i+1∏j=1

eπj (xj)pπj−1πj

=

i∏j=1

eπj (xj)pπj−1πj

· eπi+1(xi+1)pπiπi+1

= pk,i · el(xi+1)pkl

12.8.3 Viterbi variables

So the idea is that the optimal path for the prefix x1, . . . , xi+1 uses a path for x1, . . . , xi that is optimalamong all paths ending in some unknown state πi.

Given a prefix (x1, x2, . . . , xi), let vk(i) denote the probability that the most probable path is in statek when it generates symbol xi at position i. Then for any state l ∈ Q:

vl(i + 1) = maxk∈Q

{vk(i) · weight of edge between (k, i) and (l, i + 1)}

= maxk∈Q

{vk(i) · pkl · el(xi+1)}

= el(xi+1) maxk∈Q

{vk(i)pkl}, (12.2)

with vb(0) = 1 and vk(0) = 0 for k 6= begin state, initially.

Definition 12.8.2 The vk(i) are called Viterbi variables.

Again we turn to the CpG-islands. Then for a sequence x = x0x1x2x3 . . . xi−2xi−1xi . . . we need tofind the values vl(i) for each of the 8+1 states l in Q. We keep a pointer to those entries belonging tothe most probable path (here in bold):

x0 x1 x2 x3 . . . xi−2 xi−1 xi xi+1

1 0 0 0 . . . 0 0 0 . . .A+ A+ A+ . . . A+ A+ A+ . . .C+ C+ C+ . . . C+ C+ C+

G+ G+ G+ . . . G+ G+ G+

T+ T+ T+ . . . T+ T+ T+

0 A− A− A− . . . A− A− A−C− C− C− . . . C− C− C−G− G− G− . . . G− G− G−T− T− T− . . . T− T− T−

12.8.4 The Viterbi-algorithm

Input: HMM M = (Σ, Q, P, e) and symbol sequence xOutput: Most probable path π∗.

Initialization: (i = 0): v0(0) = 1, vk(0) = 0 for k > 0.

204 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

For all i = 1, . . . , L, l ∈ Q:vl(i) = el(xi) maxk∈Q(vk(i− 1)pkl)ptri(l) = argmaxk∈Q(vk(i− 1)pkl)

Termination: P (x, π∗) = maxk∈Q(vk(L)pk0)The most probable path is π∗

L = argmaxk∈Q(vk(L)pk0)

Traceback:For all i = L, . . . , 1:

π∗i−1 = ptri(π

∗i )

Implementation hint: instead of multiplying many small values, add their logarithms!

(Exercise: Run-time complexity

12.8.5 Example for Viterbi

Given the sequence C G C G and the HMM for CpG-islands. Here is a table of possible values forv:

State

sequencev C G C G0 1 0 0 0 0

A + 0 0 0 0 0C + 0 .001 0 .0013 0G + 0 0 .0036 0 3.5 · 10−4

T + 0 0 0 0 0A − 0 0 0 0 0C − 0 .001 0 2.4 · 10−5 0G − 0 0 3 · 10−7 0 9.5 · 10−6

T − 0 0 0 0 0

12.8.6 Viterbi-decoding of the casino example

We used the fair/loaded HMM to first generate a sequence of symbols and then use the Viterbi-algorithm to decode the sequence, result:

Symbols: 24335642611341666666526562426612134635535566462666636664253States : FFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLFFFFFFFFFFLLLLLLLLLLLLLFFFFViterbi: FFFFFFFFFFFFFFLLLLLLLLLLLLLLLLFFFFFFFFFFFFLLLLLLLLLLLLLFFFF

Symbols: 35246363252521655615445653663666511145445656621261532516435States : FFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLFFLLLLLLLLLLLLLLFFFFFFFFFViterbi: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Symbols: 5146526666States : FFLLLLLLLLViterbi: FFFFFFLLLL

12.9 Forward-algorithm: Computing P(x | M)

Now we turn to the second question: Assume we are given an HMM M and a sequence of symbolsx. For Markov chains we calculated the probability of a chain, P(x) in order to distinguish whether

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 205

x came say from a CpG-island or not. For the probability, that x was generated by an HMM M wehave to sum over all possible state sequences π:

P(x | M) =∑π

P(x, π | M),

A first idea would be to calculate this by using the equation

P(x | M) =∑π

P(x | π,M)P(π | M)

This algorithm is obtained from the Viterbi-algorithm by replacing max by a sum. More precisely, wedefine the forward-variable:

fk(i) = P(x1 . . . xi, πi = k),

that equals the probability, that model M reports the prefix sequence (x1, . . . , xi) and reaches in πi

state k.

The equivalent recursion is then:

fl(i + 1) = el(xi+1)∑k∈Q

fk(i)pkl (12.3)

fp(i)

fl(i+1)

fq(i)

fr(i)

fs(i)

ft(i)ptl

Input: HMM M = (Σ, Q, P, e) and sequence of symbols xOutput: probability P(x | M)

Initialization: (i = 0): f0(0) = 1, fk(0) = 0 for k > 0.

For all i = 1 . . . L, l ∈ Q:fl(i) = el(xi)

∑k∈Q fk(i− 1)pkl

Termination: P(x | M) =∑

k∈Q fk(L)pk0

Note: again we face the problem of having to multiply many very small numbers. However, here a logtransformation is not so straightforward.

Lemma: The complexity of the forward algorithm is O(|Q|2L).

206 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

12.10 Backward-algorithm

Using the Viterbi algorithm we can search for the most probable path, ie. state sequence that hasgenerated an observed sequence. But this does not help us with prediction on future states.

Definition 12.10.1 The backward-variable is defined as the probability of being at state πi = k andthen to generate (emit) the suffix sequence (xi+1, . . . , xL):

bk(i) = P(xi+1 . . . xL, πi = k) (12.4)

The backward algorithm uses a very similar recursion formula:

bk(i) =∑l∈Q

el(xi+1)bl(i + 1)pkl (12.5)

bp(i+1)

bk(i)

bq(i+1)

br(i+1)

bs(i+1)

bt(i+1)

pkt

Input: HMM M = (Σ, Q, P, e) and sequence of symbols xOutput: probability P(x | M)

Initialization: (i = L): bk(L) = pk0 for all k.

For all i = L− 1 . . . 1, k ∈ Q:bk(i) =

∑l∈Q el(xi+1)bl(i + 1)pkl

Termination: P(x | M) =∑

l∈Q(p0lel(x1)bl(1))

12.11 Comparison of the three variables

Viterbi vk(i) probability, with which the most probable state pathgenerates the sequence of symbols (x1, x2, . . . , xi) andthe system is in state k at time i.

Forward fk(i) probability, that the prefix sequence of symbolsx1, . . . , xi is generated, and the system is in state kat time i.

Backward bk(i) probability, that the system starts in state k attime i and then generates the sequence of symbolsxi+1, . . . , xL.

12.12 Posterior probabilities

Assume an HMM M and a sequence of symbols x are given.

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 207

Let P (πi = k | x) denote the probability that the HMM is in state πi = k, given that the symbol xi

is observed. We call this the posterior probability, as it is computed after observing the sequence x.

We first calculate the probability of producing the entire sequence x with the ith symbol being pro-duced by state k:

P(πi = k, x) = P(πi = k, x1 . . . xi)P(xi+1 . . . xL | πi = k, x1 . . . xi)= P(πi = k, x1 . . . xi)P(xi+1 . . . xL | πi = k) (12.6)= fk(i)bk(i) (12.7)

Thus we have:P(πi = k | x) =

P(πi = k, x)P(x)

=fk(i)bk(i)

P(x)(12.8)

as P(X, Y ) = P(X | Y )P(Y ) and by definition of the forward- and backward-variable.

Here a typical question could be: what is the probability that the dealer in the casino used a loadeddie at moment i.

12.12.1 Decoding with posterior probabilities

The posterior probability is also used as an alternative to the Viterbi-decoding that is useful e.g.,when many other paths exist that have a similar probability to π∗.

We define a sequence of states πi as:

πi = argmaxk∈QP (πi = k | x),

in other words, at every position we choose the most probable state for that position.

This decoding may be useful, if we are interested in the state at a specific position i and not in thewhole sequence of states.

Warning: if the transition matrix forbids some transitions (i.e., pkl = 0), then this decoding mayproduce a sequence that is not a valid path, because its probability is 0!

12.13 Estimating parameters for HMMs

The preceding analyses assumed that we know the parameters of the HMM, such as the transitionand emission probabilities. Now we turn to the question: How does one generate an HMM?

First step: Determine its “topology”, i.e. the number of states and how they are connected viatransitions of non-zero probability.

Second step: Estimate the parameters, i.e. the transition probabilities pkl and the emission proba-bilities ek(b).

We consider the second step. Given a set of example sequences. Our goal is to “train” the parametersof the HMM using the example (training) sequences, e.g. to set the parameters in such a way that theprobability, with which the HMM generates the given example sequences, is maximized.

12.13.1 Log-likelihood

Given sequences of symbols x1, x2, . . . , xn. We assume, that the sequences of symbols are independentand therefore P(x1, . . . , xn) = P(x1) · . . . · P(xn) holds.

208 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

Let M = (Σ, Q, P, e) be an HMM. In order to construct the parameters of the model that will bestcharacterize the training sequences, we need to assign a score:

l(x1, . . . , xn | (P, e)) = log P(x1, . . . , xn | (P, e)) =n∑

j=1

log P(xj | (P, e)).

The goal is to choose parameters (P, e) so that we maximize this score, called the log likelihood:

(P, e) = arg max(P ′,e′)

l(x1, . . . , xn | (P ′, e′)). (12.9)

12.13.2 Supervised learning

It was easier to calculate the probability of an observed sequence when the (generating) path is known.The same holds for the parameter estimation: it is easier to estimate the probabilities of the HMMwhen all paths to all training sequences are known.

Supervised learning: estimation of parameters when both input (symbols) and output (states) areprovided.

An example for such supervised learning for the CpG island case would be to have genomic sequences,where all islands would be labelled.

Given a list of observed symbol sequences x1, x2, . . . , xn and a list of corresponding paths π1, π2, . . . , πn.

We want to choose the parameters (P, e) for the transition and emission probabilities of the HMM Moptimally, such that:

P(x1, . . . , xn, π1, . . . , πn | M = (Σ, Q, P, e)) =

max(P ′,e′)

P(x1, . . . , xn, π1, . . . , πn | M = (Σ, Q, P ′, e′)).

In other words, we want to determine the Maximum Likelihood Estimator (ML-estimator) for (P, e).

12.13.3 ML-Estimation for (P, e)

(Recall: If we consider P(D | M) as a function of D, then we call this a probability; as a function ofM , then we use the word likelihood.)

ML-estimation:

(P, e)ML = argmax(P ′,e′)P(x1, . . . , xn, π1, . . . , πn | M = (Σ, Q, P ′, e′)).

Computation: we count the number of times each particular transition or emission occurs in thetraining sequences:

Pkl: Number of transitions from state k to lEk(b): Number of emissions of b in state k

We obtain the ML-estimators for (P, e) by setting:

pkl =Pkl∑

q∈Q Pkqand ek(b) =

Ek(b)∑s∈Σ Ek(s)

. (12.10)

12.13.4 Training the fair/loaded HMM

Given example data x and π:

Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5States pi: F F F F F F F L L L L F F F

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 209

State transitions:Pkl 0 F L0FL

pkl 0 F L0FL

Emissions:Ek(b) 1 2 3 4 5 6

0FL

ek(b) 1 2 3 4 5 60FL

Given example data x and π:

Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5States pi: F F F F F F F L L L L F F F

State transitions:Pkl 0 F L0 0 1 0F 1 8 1L 0 1 3

pkl 0 F L0 0 1 0F 1

10810

110

L 0 14

34

Emissions:Ek(b) 1 2 3 4 5 6

F 3 2 1 1 2 1L 0 1 1 0 0 2

→ek(b) 1 2 3 4 5 6

F 310

210

110

110

210

110

L 0 14

14 .0 .0 1

2

12.13.5 Pseudocounts

One problem in training is overfitting. For example, if some possible transition k 7→ l is never seen inthe example data, then we will set pkl = 0 and the transition is then forbidden.

If a given state k is never seen in the example data, then pkl is undefined for all l, because nominatorand denominator will have value zero.

To solve this problem, we introduce pseudocounts rkl and rk(b) and define:

Pkl = number of transitions from k to l in the example data + rkl

Ek(b) = number of emissions of b in k in the example data + rk(b).

Small pseudocounts reflect “little pre-knowledge”, large ones reflect “more pre-knowledge”.

12.13.6 Unsupervised learning

Unsupervised learning: In practice, one usually has access only to the input (symbols) and not to theoutput (states).

Given sequences of observed symbols x1, x2, . . . , xn, for which we do NOT know the correspondingstate paths π1, . . . , πn.

Unfortunately, the problem of choosing the parameters (P, e) of HMM M optimally so that

P(x1, . . . , xn | M = (Σ, Q, P, e)) =

max(P ′,e′)

P(x1, . . . , xn | M = (Σ, Q, P ′, e′))

holds, is NP -complete.

210 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

12.13.7 The expected number of transitions and emissions

The idea to estimate the parameters is to use an iteration method.

Suppose we are given an HMM M and training data x1, . . . , xn. The probability that transition k → lis used at position i in sequence x is:

P (πi = k, πi+1 = l | x, (P, e)) =P(πi = k, πi+1 = l, x | (P, e))

P(x)

=P(πi = k, x1, . . . , xi | (P, e))P(πi+1 = l, xi+1, . . . , xL | (P, e))

P(x)

=fk(i)pklel(xi+1)bl(i + 1)

P(x). (12.11)

We estimate the expected number of times that transition k → l is used by summing over all positionsand all training sequences:

Pkl =∑

j

1P(xj)

∑i

f jk(i)pklel(x

ji+1)b

jl (i + 1), (12.12)

where f jk and bj

l are the forward and backward variable computed for sequence xj , respectively.

Similarly, the expected number of times that letter b appears in state k is given by:

Ek(b) =∑

j

1P(xj)

∑{i|xj

i =b}

f jk(i)bj

k(i), (12.13)

where the inner sum is only over those positions i for which the symbol emitted is b.

12.13.8 The Baum-Welch-algorithm

Let M = (Σ, Q, P, e) be an HMM and suppose that training sequences x1, x2, . . . , xn are given. Theparameters (P, e) are to be iteratively improved as follows:

- Using the current value of (P, e), estimate the expected number Pkl of transitions from state k tostate l and the expected number Ek(b) of emissions of symbol b in state k.

- Then, set

pkl =Pkl∑

q∈Q Pkqand ek(b) =

Ek(b)∑s∈Σ Ek(s)

.

- This is repeated until some halting criterion is met.

This is a special case of the so-called EM-technique. (EM=expectation maximization).

Baum-Welch AlgorithmInput: HMM M = (Σ, Q, P, e), training data x1, x2, . . . , xn

Output: HMM M ′ = (Σ, Q, P ′, e′) with an improved log likelihood score.Initialize:Set P and e to arbitrary model parametersRecursion:

Set Pkl = 0, Ek(b) = 0 or to their pseudocountsFor every sequence xj :

Compute f j , bj and P(xj)Increase Pkl by 1

P(xj)

∑i f

jk(i)pklel(x

ji+1)b

jl (i + 1)

Increase Ek(b) by 1P(Xj)

∑{i|xj

i =b} f jk(i)bj

l (i)Add pseudocounts to Plk and Ek(b), if desiredSet pkl = PklP

q∈Q Pkqand ek(b) = Ek(b)P

s∈Σ Ek(s) .

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 211

Determine the new Log-likelihood l(x1, . . . , xn | (P, e))End: Terminate, when the score does not improve or

a maximum number of iterations is reached.

Remark about Convergence:

One can prove that the log-likelihood-score converges to a local maximum using the Baum-Welch-algorithm.

However, this doesn’t imply that the parameters converge!

Local maxima can be avoided by considering many different starting points.

Additionally, any standard optimization approach can also be applied to solve the optimization prob-lem.

12.13.9 Example of Baum-Welch training

Assume that a casino uses two dice, fair and loaded. Here are the the original HMM and two modelsobtained using the Baum-Welch algorithm, one based on a sequence of 300 and other on 30 000observations (see Durbin et al., p. 65-66):

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.05

0.1

0.95 0.9

UnfairFair

1: 0.072: 0.103: 0.104: 0.175: 0.056: 0.52

0.27

0.29

UnfairFair

0.73 0.71

1: 0.192: 0.193: 0.234: 0.085: 0.236: 0.08

3: 0.10

UnfairFair

1: 0.172: 0.173: 0.174: 0.175: 0.176: 0.15

0.93

0.07

0.12

1: 0.102: 0.11

4: 0.115: 0.106: 0.48

0.88

Original Baum-Welch Baum-Welchparameters using 300 rolls using 30 000 rolls

12.13.10 Viterbi training

An alternative to the Baum-Welch training algorithm is the Viterbi training:Let M = (Σ, Q, P, e) be an HMM and suppose that training sequences x1, x2, . . . , xn are given. Theparameters (P, e) are to be iteratively improved as follows:

.3cm Start with a guess for initial parameters and use the Viterbi algorithm to compute the currentlymost probable paths π∗1, . . . , π∗n.

Iterate until π∗1, . . . , π∗n cannot be improved (with some halting criterion):

1. Based on π∗1, . . . , π∗n compute Pkl (the number of transitions from state k to l) and Ek(b) (thenumber of emissions of b in state k)

2. Update the model parameters (P, e):

pkl =Pkl∑

q∈Q Pkqand ek(b) =

Ek(b)∑s∈Σ Ek(s)

.

3. Compute π∗1, . . . , π∗n from updated model parameters

12.14 Profile HMMs for protein identification

Alignment of seven Globin sequences:

#A-helices ...........AAAAAAAAAAAAAAAA...BBBBBBBBBBBBBBBBCCCCCCCCCCC....DDDDDDDEEEEEEEEEEEE

GLB1_GLYDI .........GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSAHPQMAAVFG.FSG....AS...DPGVAALGAKVL

HBB_HUMAN ........VHLTPEEKSAVTALWGKV....NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVL

HBA_HUMAN .........VLSPADKTNVKAAWGKVGA..HAGEYGAEALERMFLSFPTTKTYFPHF.DLS.....HGSAQVKGHGKKVA

212 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

MYG_PHYCA .........VLSEGEWQLVLHVWAKVEA..DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVL

GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYS..TYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADQLKKSADVRWHAERII

GLB3_CHITP ..........LSADQISTVQASFDKVKG......DPVGILYAVFKADPSIMAKFTQFAG.KDLESIKGTAPFETHANRIV

LGB2_LUPLU ........GALTESQAALVKSSWEEFNA..NIPKHTHRFFILVLEIAPAAKDLFS.FLK.GTSEVPQNNPELQAHAGKVF

#A-helices EEEEEEEEE............FFFFFFFFFFFF..FFGGGGGGGGGGGGGGGGGGG.....HHHHHHHHHHHHHHHHHHH

GLB1_GLYDI AQIGVAVSHL..GDEGKMVAQMKAVGVRHKGYGNKHIKAQYFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGAL

HBB_HUMAN GAFSDGLAHL...D..NLKGTFATLSELHCDKL..HVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL

HBA_HUMAN DALTNAVAHV...D..DMPNALSALSDLHAHKL..RVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVL

MYG_PHYCA TALGAILKK....K.GHHEAELKPLAQSHATKH..KIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDI

GLB5_PETMA NAVNDAVASM..DDTEKMSMKLRDLSGKHAKSF..QVDPQYFKVLAAVIADTVAAG.........DAGFEKLMSMICILL

GLB3_CHITP GFFSKIIGEL..P...NIEADVNTFVASHKPRG...VTHDQLNNFRAGFVSYMKAHT..DFA.GAEAAWGATLDTFFGMI

LGB2_LUPLU KLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG...VADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVI

#A-helices HHHHHHH....

GLB1_GLYDI ISGLQS.....

HBB_HUMAN AHKYH......

HBA_HUMAN TSKYR......

MYG_PHYCA AAKYKELGYQG

GLB5_PETMA RSAY.......

GLB3_CHITP FSKM.......

LGB2_LUPLU KKEMNDAA...

Question: How can this family be characterized?

12.14.1 Characterization?

How can one characterize a family of protein sequences?

• Exemplary sequence? E.g. the center star sequence?

• Consensus sequence?

• Regular expression (Prosite):

GLB1_GLYDI ...IAGADNGAGV...LGB2_LUPLU ...FNA--NIPKH...

...[FI]-[AN]-[AG]-x(1,2)-N-[IG]-[AP]-[GK]-[HV]...

• HMM from the profile?

12.14.2 Profile HMMs

Definition 12.14.1 A profile HMM is a probabilistic representation of a multiple alignment. A givenmultiple alignment (of a protein family) is used to build a profile HMM.

This model then may be used to find and score less obvious potential matches of new protein se-quences.

12.14.3 Gapless HMM

We notice that a general feature of protein family multiple sequence alignments is that gaps tend toline up with each other, and that there are blocks that are completely gapless. We will therefore firstdescribe an HMM for these gapfree regions. Consider the following block from the alignment of theseven globin sequences:

GLB1_GLYDI ...DPGVAALGAKVL...HBB_HUMAN ...NPKVKAHGKKVL...HBA_HUMAN ...SAQVKGHGKKVA...MYG_PHYCA ...SEDLKKHGVTVL...GLB5_PETMA ...SADVRWHAERII...GLB3_CHITP ...TAPFETHANRIV...LGB2_LUPLU ...NPELQAHAGKVF...

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 213

In order to specify an HMM, we need to determine the 4 elements of the model M :

here clearly the emission alphabet is the alphabet of the amino acids. The states here consist of onlyone element, the so-called match states, and the transition probability going from match state Mi toMi+1 is equal to 1. We will later see how to generalize this. Then the emission probability ei(a) isthe probability of observing amino acid a at position i. The corresponding graph for the block abovelooks as follows

AEP

DEGKPQ

FLV

AEKQR

AGKTW

HL

DNST

AG

AEGKNV

According to this model then the probability of a new sequence x given this model M is

P(x|M) =L∏

i=1

ei(xi)

where L is equal to the length of the block considered. As usual we actually want to evaluate thelog-odds ratio

S(x) =L∑

i=1

logei(xi)qxi

where qa is the probability of observing amino acid a under a random model. Here we notice that thisis a similar approach as we considered in chapter 6.13.1 where we defined profiles of multiple sequencealignments. Thus this trivial HMM is equivalent to the PSSM (Position Specific Score Matrix).

12.14.4 Insert-states

In order to capture not just the information of the gapfree regions of an msa, we need to take accountof gaps. Our first idea would be to allow gaps at each position using the same gap score γ(g) at eachposition. However, as we have seen from the alignment, this does not consider the fact that there is aposition dependency of gaps. Thus we need to model position dependent gap scores just as we definedposition dependent substitution scores.

Again we have the match states Mj , with emission probabilities now denoted by eMj (a), and transitionprobability, pMjMj+1 equal to 1 between match states.

Next we introduce so-called insert-states Ij , that model the insertion after position j. The insertstates emit symbols with probability eIj (a), which we set equal to their background probabilities qa.

Then there is transition probability from a match state to an insert state, a loop transition from aninsert state to itself, as well as a transition probability from an insert state back to the next matchstate. The corresponding graph then looks as follows:

MjBegin end

Ij

The (log) score of a gap of length k is then

log pMjIj + log pIjMj+1 + (k − 1) log pIjIj

214 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

12.14.5 Delete-states

Finally we also need to take into account deletions in a sequence. We introduce so-called delete-states,Dj , that are so-called silent states (similar to the begin and end states), since they do not emit anysymbols. Again there can be a series of delete states following each other, which allows for exampleto model the absence of individual domains. Again we have to consider transition probabilities pMD,pDD and pDM .

MjBegin end

Dj

12.14.6 Topology of a profile-HMM

Begin End

Match-state, Insert-state, Delete-state

12.14.7 Design of a profile-HMM

Given a multiple alignment of a family of sequences.

First we must decide which positions are to be modeled as match-, which positions are to modeled asdelete-states and which positions as insert-states. Rule-of-thumb: columns with more than 50% gapsshould be modeled as insert-states.

Next we determine the transition and emission probabilities. Since an msa is given, we and can simplycount the observed transitions Pkl and emissions Ek(b) and calculate then:

pkl =Pkl∑l′ Pkl′

and ek(b) =Ek(b)∑b′ Ek(b′)

.

Obviously, it may happen that certain transitions or emissions do not appear in the training data andthus we use the Laplace-rule and add 1 to each count.

12.14.8 Aligning a sequence to a profile HMM

A variant of the Viterbi algorithm can be used to align a sequence X of length m with a profile HMMof length L.

Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007 215

Let V Mj (i) be the log odds score of the best path matching subsequence X[1 . . . i] to the submodel

up to state j, ending with Xi being emitted by match state Mj . Similarly, we define V Ij (i) as the log

odds score of the best path ending with Xi being emitted by insert state Ij , and V Dj (i) as the log

odds score of the best path ending with Xi being emitted by delete state Dj . These scores can becomputed via the following recursion formulas:

V Mj (i) = log

eMj (Xi)qXi

+ max

V M

j−1(i− 1) + log pMj−1Mj ,

V Ij−1(i− 1) + log pIj−1Mj ,

V Dj−1(i− 1) + log pDj−1Mj ,

(12.14)

V Ij (i) = log

eIj (Xi)qXi

+ max

V M

j−1(i− 1) + log pMjIj ,

V Ij−1(i− 1) + log pIjIj ,

V Dj−1(i− 1) + log pDjIj ,

(12.15)

V Dj (i) = max

V M

j−1(i) + log pMj−1Dj ,

V Ij−1(i) + log pIj−1Dj ,

V Dj−1(i) + log pDj−1Dj ,

(12.16)

12.14.9 Aligning a sequence to a profile HMM

Typically there is no score eIj (Xi) in V Ij (i) since this is mostly set to the background probabilities.

Also in reality the transition terms D → I and I → D may not be present.

Care must be taken over initialisation and termination.

When we compare these recursion formulas we realize that this in principle identical to the algorithmforaffine gap alignments, with the extension to position dependencies.

We have just noticed the close relationship of the Viterbi variant of aligning a sequence to a profileHMM and the global dynamic programming algorithm with affine gap functions. Similarly, howeverwith more care an algorithm can be formulated for local alignments of a sequence to a profile HMM.

For detailed presentation the reader is referred to Durbin et al., Chapter 5.4-5.8.

12.14.10 Practice of building Profile HMMs for proteins

• Use Blast to separate a protein database into families of related proteins.

• Construct a multiple alignment for each protein family.

• Construct a profile HMM model and optimize the parameters of the model (transition andemission probabilities).

• Align the target sequence against each HMM to find the best fit between a target sequence andan HMM.

12.14.11 Application of Profile HMM to Modeling Globin Proteins

Globins represent a large collection of protein sequences. .3cm

• 400 globin sequences were randomly selected from all globins and used to construct a multiplealignment.

• Multiple alignment was used to assign an initial HMM.

• This model then got trained repeatedly with model lengths chosen randomly between 145 to170, to get an HMM model optimized probabilities.

216 Algorithms in Bioinformatics I, WS’06, ZBIT, C. Dieterich, February 6, 2007

How Good is the Globin HMM?

625 remaining globin sequences in the database were aligned to the constructed HMM resulting ina multiple alignment. This multiple alignment agrees extremely well with the structurally derivedalignment.

25,044 proteins, were randomly chosen from the database and compared against the globin HMM.This experiment resulted in an excellent separation between globin and non-globin families.

12.14.12 Profile HMM software

• SAM: Sequence Alignment and Modeling System(http://www.cse.ucsc.edu/research/compbio/sam.html)

• HMMER: profile HMMs for protein sequence analysis(http://hmmer.wustl.edu/)

• Meta-MEME: a software toolkit for building and using motif-based hidden Markov models ofDNA and proteins. The input to Meta-MEME is a set of similar protein sequences, as wellas a set of motif models discovered by MEME. Meta-MEME combines these models into asingle, motif-based hidden Markov model and uses this model to search a sequence database forhomologs. (http://metameme.sdsc.edu/)

Also recommended for reading:

Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755-63.