Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden...

56
GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm

Transcript of Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden...

Page 1: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

GLOBEX Bioinformatics

(Summer 2015)

Hidden Markov Models (I)

a. The model

b. The decoding: Viterbi algorithm

Page 2: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Hidden Markov models • A Markov chain of states

• At each state, there are a set of possible observables (symbols), and

• The states are not directly observable, namely, they are hidden.

• E.g., Casino fraud

• Three major problems

– Most probable state path

– The likelihood

– Parameter estimation for HMMs

1: 1/6

2: 1/6

3: 1/6

4: 1/6

5: 1/6

6: 1/6

1: 1/10

2: 1/10

3: 1/10

4: 1/10

5: 1/10

6: 1/2

0.05

0.1

0.95 0.9

Fair Loaded

Page 3: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

A biological example: CpG islands

• Higher rate of Methyl-C mutating to T in CpG dinucleotides →

generally lower CpG presence in genome, except at some biologically

important ranges, e.g., in promoters, -- called CpG islands.

• The conditional probabilities P±(N|N’)are collected from ~ 60,000 bps

human genome sequences, + stands for CpG islands and – for non

CpG islands.

P- A C G T

.300 .205 .285 .210

.322 .298 .078 .302

.248 .246 .298 .208

.177 .239 .292 .292

A

C

G

T

P+ A C G T

.180 .274 .426 .120

.171 .368 .274 .188

.161 .339 .375 .125

.079 .355 .384 .182

A

C

G

T

Page 4: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Task 1: given a sequence x, determine if it is a CpG island.

One solution: compute the log-odds ratio scored by the two Markov chains:

S(x) = log [ P(x | model +) / P(x | model -)]

where P(x | model +) = P+(x2|x1) P+(x3|x2)… P+(xL|xL-1) and

P(x | model -) = P-(x2|x1) P-(x3|x2)… P-(xL|xL-1)

S(x)

# o

f se

qu

ence

Histogram of the length-normalized scores

(CpG sequences are shown as dark shaded )

Page 5: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Task 2: For a long genomic sequence x, label these CpG islands, if there are any.

Approach 1: Adopt the method for Task 1 by calculating the log-odds score for a window of, say, 100 bps around every nucleotide and plotting it.

Problems with this approach:

– Won’t do well if CpG islands have sharp boundary and variable length

– No effective way to choose a good Window size.

Page 6: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Approach 2: using hidden Markov model

A: .170

C: .368

G: .274

T: .188

A: .372

C: .198

G: .112

T: .338

0.35

0.30

0.65 0.70

+ − • The model has two states, “+” for CpG island and “-” for non

CpG island. Those numbers are made up here, and shall be fixed

by learning from training examples.

• The notations: akl is the transition probability from state k to state

l; ek(b) is the emission frequency – probability that symbol b is

seen when in state k.

Page 7: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

The probability that sequence x is emitted by a state path π is:

P(x, π) = ∏i=1 to L eπi (xi) a πi πi+1

i:123456789

x:TGCGCGTAC

π :--++++---

P(x, π) = 0.338 × 0.70 × 0.112 × 0.30 × 0.368 × 0.65 × 0.274 × 0.65 × 0.368 × 0.65 ×

0.274 × 0.35 × 0.338 × 0.70 × 0.372 × 0.70 × 0.198.

Then, the probability to observe sequence x in the model is

P(x) = π P(x, π),

which is also called the likelihood of the model.

A: .170

C: .368

G: .274

T: .188

A: .372

C: .198

G: .112

T: .338

0.35

0.30

0.65 0.70

+ −

Page 8: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Decoding: Given an observed sequence x, what is the most probable state path,

i.e.,

π* = argmax π P(x, π)

Q: Given a sequence x of length L, how many state paths do we have?

A: NL, where N stands for the number of states in the model.

As an exponential function of the input size, it precludes enumerating all possible state

paths for computing P(x).

Page 9: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Let vk(i) be the probability for the most probable path ending at position i with a state k.

Viterbi Algorithm

Initialization: v0(0) =1, vk(0) = 0 for k > 0.

Recursion: vk(i) = ek(xi) maxj (vj(i-1) ajk);

ptri(k) = argmaxj (vj(i-1) ajk);

Termination: P(x, π* ) = maxk(vk(L) ak0);

π*L = argmaxj (vj(L) aj0);

Traceback: π*i-1 = ptri (π*i).

Page 10: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

A: .170

C: .368

G: .274

T: .188

A: .372

C: .198

G: .112

T: .338

0.35

0.30

0.65 0.70

+ −

ε A A G C G

0

+

1

0

0

Vk(i)

0 0 0 0 0

.186

.085

. 048

.0095

.0038

.0039 .00093

.00053 .000042

.00038

i

k

vk(i) = ek(xi) maxj (vj(i-1) ajk);

− − + + + hidden state path

ε 0.5 0.5

0

Page 11: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Casino Fraud: investigation results by Viterbi decoding

Page 12: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

• The log transformation for Viterbi algorithm

vk(i) = ek(xi) maxj (vj(i-1) ajk);

ajk = log ajk;

ek(xi) = log ek(xi);

vk(i) = log vk(i);

vk(i) = ek(xi) + maxj (vj(i-1) + ajk);

Page 13: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

GLOBEX Bioinformatics

(Summer 2015)

Hidden Markov Models (II)

• The model likelihood: Forward

algorithm, backward algorithm

• Posterior decoding

Page 14: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

The probability that sequence x is emitted by a state path π is:

P(x, π) = ∏i=1 to L eπi (xi) a πi πi+1

i:123456789

x:TGCGCGTAC

π :--++++---

P(x, π) = 0.338 × 0.70 × 0.112 × 0.30 × 0.368 × 0.65 × 0.274 × 0.65 × 0.368 × 0.65 ×

0.274 × 0.35 × 0.338 × 0.70 × 0.372 × 0.70 × 0.198.

Then, the probability to observe sequence x in the model is

P(x) = π P(x, π),

which is also called the likelihood of the model.

A: .170

C: .368

G: .274

T: .188

A: .372

C: .198

G: .112

T: .338

0.35

0.30

0.65 0.70

+ −

Page 15: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

How to calculate the probability to observe sequence x in the model?

P(x) = π P(x, π)

Let fk(i) be the probability contributed by all paths from the beginning up

to (and include) position i with the state at position i being k.

The the following recurrence is true:

f k(i) = [ j f j(i-1) ajk ] ek(xi)

Graphically,

0

1

N

1

N

1

N

1

N

0

x1 x2 x3 xL

Again, a silent state 0 is introduced for better presentation

Page 16: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Forward algorithm

Initialization: f0(0) =1, fk(0) = 0 for k > 0.

Recursion: fk(i) = ek(xi) jfj(i-1) ajk.

Termination: P(x) = kfk(L) ak0.

Time complexity: O(N2L), where N is the number of states and L is the

sequence length.

Page 17: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Let bk(i) be the probability contributed by all paths that pass

state k at position i.

bk(i) = P(xi+1, …, xL | (i) = k)

Backward algorithm

Initialization: bk(L) = ak0 for all k.

Recursion (i = L-1, …, 1): bk(i) = j akj ej(xi+1) bj(i+1).

Termination: P(x) = k a0k ek(x1)bk(1).

Time complexity: O(N2L), where N is the number of states and L is the

sequence length.

Page 18: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Posterior decoding

P(πi = k |x) = P(x, πi = k) /P(x) = fk(i)bk(i) / P(x)

Algorithm:

for i = 1 to L

do argmax k P(πi = k |x)

Notes: 1. Posterior decoding may be useful when there are multiple almost

most probable paths, or when a function is defined on the states.

2. The state path identified by posterior decoding may not be most

probable overall, or may not even be a viable path.

Page 19: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

GLOBEX Bioinformatics

(Summer 2015)

Hidden Markov Models (III)

- Viterbi training

- Baum-Welch algorithm

- Maximum Likelihood

- Expectation Maximization

Page 20: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Model building

- Topology - Requires domain knowledge

- Parameters - When states are labeled for sequences of observables

- Simple counting:

akl = Akl / l’Akl’ and ek(b) = Ek(b) / b’Ek(b’)

- When states are not labeled

Method 1 (Viterbi training) 1. Assign random parameters

2. Use Viterbi algorithm for labeling/decoding

2. Do counting to collect new akl and ek(b);

3. Repeat steps 2 and 3 until stopping criterion is met.

Method 2 (Baum-Welch algorithm)

Page 21: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Baum-Welch algorithm (Expectation-Maximization)

• An iterative procedure similar to Viterbi training

• Probability that akl is used at position i in sequence j.

P(πi = k, πi+1= l | x,θ ) = fk(i) akl el (xi+1) bl(i+1) / P(xj)

Calculate the expected number of times that is used by summing over all position and over all training sequences.

Akl = j {(1/P(xj) [i fkj(i) akl el (x

ji+1) bl

j(i+1)] }

Similarly, calculate the expected number of times that symbol b is emitted in state k.

Ek(b) =j {(1/P(xj) [{i|x_i^j = b} fkj(i) bk

j(i)] }

Page 22: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Maximum Likelihood

Define L() = P(x| )

Estimate such that the distribution with the estimated best agrees with or support the data observed so far.

ML = argmax L()

E.g. There are red and black balls in a box. What is the probability P of picking up a black ball?

Do sampling (with replacement).

Page 23: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Maximum Likelihood

Define L() = P(x| )

Estimate such that the distriibution with the estimated best agrees with or supports the

data observed so far.

ML= argmax L()

When L() is differentiable,

For example, want to know the ratio: # of blackball/# of whiteball, in other words, the

probability P of picking up a black ball. Sampling (with replacement):

Prob ( iid) = p9 (1-p) 91

Likelihood L(p) = p9(1-p)91.

=> PML = 9/100 = 9%. The ML estimate of P is just the frequency.

8 91 9 90( )9 (1 ) 91 (1 ) 0

L pp p p p

P

|

( )0

ML

L

100

times

Counts: whiteball 91,

blackball 9

Page 24: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

A proof that the observed frequency -> ML estimate of

probabilities for polynomial distribution

Let Counts ni for outcome i

The observed frequencies i = ni /N, where N = i ni

If i ML = ni /N, then P(n| ML ) > p(n| ) for any ML

Proof:

( )

( | )log log log ( )

( | ) ( )

log( ) N log( ) log( )

H( || ) 0

i

i

i

nMLnMLiML

i i

nii i

i

ML ML MLMLi i i i

i i

i i ii i i

ML

P n

P n

nn

N

Page 25: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Maximum Likelihood: pros and cons

- Consistent, i.e., in the limit of a large amount of data, ML

estimate converges to the true parameters by which the data

are created.

- Simple

- Poor estimate when data are insufficient.

e.g., if you roll a die for less than 6 times, the ML estimate

for some numbers would be zero.

Pseudo counts:

where A = i i

,i ii

n

N A

Page 26: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Conditional Probability and Join Probability

P(one) = 5/13

P(square) = 8/13

P(one, square) = 3/13

P(one | square) = 3/8 = P(one, square) / P(square)

In general, P(D,M) = P(D|M)P(M) = P(M|D)P(D)

=> Baye’s Rule: ( | ) ( )

(M | D)( )

P D M P MP

P D

Page 27: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Conditional Probability and Conditional Independence

Page 28: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Baye’s Rule:

Example: disease diagnosis/inference P(Leukemia | Fever) = ? P(Fever | Leukemia) = 0.85 P(Fever) = 0.9 P(Leukemia) = 0.005 P(Leukemia | Fever) = P(F|L)P(L)/P(F) = 0.85*0.01/0.9 = 0.0047

( | ) ( )(M | D)

( )

P D M P MP

P D

Page 29: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Bayesian Inference

Maximum a posterior estimate

arg max ( | x)MAP P

Page 30: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Expectation Maximization

log ( | ) log ( , | ) log (y | x, )P x P x y P

P( , | ) ( | , ) (x | )x y P y x P

( | ) ( , | ) / ( | , )P x P x y P y x

log ( | ) ( | , ) log ( , | ) ( | , ) log ( | x, )t t

y y

P x P y x P x y P y x P y

( | ) ( | , ) logP(x, y | )t t

y

Q P y x

1

log ( | ) log ( | )

( | , )( | ) ( | ) ( | , ) log

( | , )

( | ) ( | )

arg max ( | )

t

tt t t t

y

t t t

t t

P x P x

P y xQ Q P y x

P y x

Q Q

Q

( | , )t

y

P y x ( ) Expectation

Maximization

Page 31: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

EM explanation of the Baum-Welch algorithm

( | ) ( | , )P x P x

( | ) ( | , ) log ( , | )t tQ P x P x

We like to

maximize by

choosing

But state path is

hidden variable. Thus,

EM.

Page 32: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

EM Explanation of the Baum-Welch algorithm

A-term E-term

A-term is maximized if

E-term is maximized if

'

'

EM klkl

kl

l

Aa

A

'

( )( )

( ')

EM kk

k

b

E be b

E b

Page 33: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

GLOBEX Bioinformatics

(Summer 2015)

Hidden Markov Models (IV)

a. Profile HMMs

b. ipHMMs

c. GeneScan

d. TMMOD

Page 34: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 35: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 36: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 37: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 38: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 39: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 40: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 41: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 42: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 43: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 44: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 45: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 46: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Friedrich et al, Bioinformatics 2006

Interaction profile HMM (ipHMM)

Page 47: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

GENSCAN (generalized HMMs)

• Chris Burge, PhD Thesis ’97, Stanford

• http://genes.mit.edu/GENSCAN.html

• Four components

– A vector π of initial probabilities

– A matrix T of state transition probabilities

– A set of length distribution f

– A set of sequence generating models P

• Generalized HMMs:

– at each state, emission is not symbols (or residues),

rather, it is a fragment of sequence.

– Modified viterbi algorithm

Page 48: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 49: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

• Initial state probabilities

– As frequency for each functional unit to occur

in actual genomic data. E.g., as ~ 80% portion

are non-coding intergenic regions, the initial

probability for state N is 0.80

• Transition probabilities

• State length distributions

Page 50: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

• Training data

– 2.5 Mb human genomic sequences

– 380 genes, 142 single-exon genes, 1492 exons

and 1254 introns

– 1619 cDNAs

Page 51: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Open areas for research

• Model building – Integration of domain knowledge, such as structural

information, into profile HMMs

– Meta learning?

• Biological mechanism DNA replication

• Hybrid models – Generalized HMM

– …

Page 52: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

TMMOD: An improved hidden Markov model for

predicting transmembrane topology

Page 53: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

globular Loop cyt

Cap

cyt Helix core

Cap

Non-cyt

Long loop

Non-cyt Cap

cyt Helix core

Cap

Non-cyt

Long loop

Non-cyt globular

globular

membrane Non-cytoplasmic side Cytoplasmic side

TMHMM by Krogh, A. et al JMB 305(2001)567-580

Accuracy of prediction for topology: 78%

Page 54: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

0

5

10

15

20

25

0 20 40 60 80 100

No. lo

ops

length

cyto

acyto

Page 55: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models
Page 56: Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm . Hidden Markov models

Mod. Reg. Data

set

Correct

topology

Correct

location

Sens-

itivity

Speci-

ficity

TMMOD 1

(a)

(b)

(c)

S-83

65 (78.3%)

51 (61.4%)

64 (77.1%)

67 (80.7%)

52 (62.7%)

65 (78.3%)

97.4%

71.3%

97.1%

97.4%

71.3%

97.1%

TMMOD 2

(a)

(b)

(c)

S-83

61 (73.5%)

54 (65.1%)

54 (65.1%)

65 (78.3%)

61 (73.5%)

66 (79.5%)

99.4%

93.8%

99.7%

97.4%

71.3%

97.1%

TMMOD 3

(a)

(b)

(c)

S-83

70 (84.3%)

64 (77.1%)

74 (89.2%)

71 (85.5%)

65 (78.3%)

74 (89.2%)

98.2%

95.3%

99.1%

97.4%

71.3%

97.1%

TMHMM S-83

64 (77.1%) 69 (83.1%) 96.2% 96.2%

PHDtm S-83

(85.5%) (88.0%) 98.8% 95.2%

TMMOD 1

(a)

(b)

(c)

S-160

117 (73.1%)

92 (57.5%)

117 (73.1%)

128 (80.0%)

103 (64.4%)

126 (78.8%)

97.4%

77.4%

96.1%

97.0%

80.8%

96.7%

TMMOD 2

(a)

(b)

(c)

S-160

120 (75.0%)

97 (60.6%)

118 (73.8%)

132 (82.5%)

121 (75.6%)

135 (84.4%)

98.4%

97.7%

98.4%

97.2%

95.6%

97.2%

TMMOD 3

(a)

(b)

(c)

S-160

120 (75.0%)

110 (68.8%)

135 (84.4%)

133 (83.1%)

124 (77.5%)

143 (89.4%)

97.8%

94.5%

98.3%

97.6%

98.1%

98.1%

TMHMM S-160 123 (76.9%) 134 (83.8%) 97.1% 97.7%