Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok...

33
Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok...

Page 1: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Efficient Estimation of Emission Probabilities

in profile HMM

By Virpi Ahola et al

Reviewed By

Alok Datar

Page 2: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Index

1. Motivation

2. Introduction

3. Method

4. Simulation Results

5. Conclusion

Page 3: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

1. Motivation

Drawbacks of HMM is that conserved amino acids are not emphasized.

Signal and noise are treated equally Hence the no. of estimated parameters is

enormous. Need to focus on conserved amino acids

only, to improve accuracy.

Page 4: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

2. Introduction

Profile HMMs originate from the profile analysis. The essence of the profile analysis is that the

information concerning the conservation of the residues is incorporated into the profile, whereby the analysis is able to detect structural similarities and homologies to the sequence family.

In HMM models, emission probabilities of all 20 amino acids are estimated in all emitting states, and thus the number of estimated parameters can be enormous.

Page 5: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

2. Introduction (continued)

For example, if the model includes 300 emitting states, the number of emission parameters is 5700.

Most of the parameters are however noise, i.e. is unconserved parameters.

This paper presents an alternative, likelihood-based approach to the problem of reducing the parameter space in HMMs.

Page 6: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

2. Introduction (continued)

The advantage of the new method is that it explicitly takes into account conservation of the alignment.

Page 7: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3. Methods

3.1 Profile HMM

3.2 Classification Algorithm

3.3 EEP Estimation Method

Page 8: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.1 Profile HMM

The profile HMM architecture (Durbin et al., 1998) has three classes of states: match state insert state delete state

The match and the insert states always emit a symbol

Delete states are silent. The model starts from the begin state and ends with

the end state.

Page 9: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.1 Profile HMM (continued)

The model length is determined by the number of positions, that is, the number of match–insert–delete state triplets between the begin and the end states.

An observation sequence {Yi } is considered to be a stochastic process with a finite set of symbols O ={o1, o2, . . . , oS}.

The state sequence, the path that goes through the model, is a finite-state Markov chain {Xi }. The emitted symbols are assumed to be conditionally independent given the states.

Page 10: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.1 Profile HMM (continued)

When the estimation is based on the sequence alignment, the columns of the multiple alignment are assigned as match or insert states before the estimation.

Thus, the path that generates the sequence is known.

Columns representing conserved positions are chosen as match states

Rest of the states as insert states.

Page 11: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.1 Profile HMM (continued)

The profile HMM has two sets of parameters: Transition probabilities Emission probabilities

Page 12: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.2 Classification Algorithm

Basis for the EEP method that in match states the emission probability distributions are conserved on some residues.

Other residues occur relatively seldom. In practice, the determination of conserved

residues is variable.

Page 13: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.2 Classification Algorithm (continued)

Page 14: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.2 Classification Algorithm (continued)

In the above algorithm At each iteration step, the residue with the

largest relative frequency with respect to its background probability was defined as effective or ineffective depending on a fixed threshold value.

Remaining probabilities were updated so that they again summed to one.

Page 15: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.2 Classification Algorithm (continued)

The renormalizing step is necessary because otherwise those residues with low background probability tend to be chosen as effective more often than those with high background probability.

Page 16: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.3 EEP Estimation Method

EEP is constructed by the log likelihood function of multinominal distribution.

Where nj is a frequency of an amino acid j.

Page 17: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.3 EEP Estimation Method (continued)

Constraints of log likelihood function defined as

Where

are constants

Page 18: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.3 EEP Estimation Method (continued)

First constraint ensures that the mutual ratios of the ineffective residues remains same as in background distribution.

Second constraint ensures that total proportion of effective residues to ineffective residues does not increase too much w.r.t to background distribution

Page 19: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.3 EEP Estimation Method (continued)

There are two possible sets of solution depending on the above mentioned inequality

Page 20: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.3 EEP Estimation Method (continued)

If the inequality is true then rescaled optimal probabilities are calculated as mentioned below

The probabilities of the ineffective residues are estimated by dividing the sum of the remaining probability in proportion to the background probability.

Page 21: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

3.3 EEP Estimation Method (continued)

If the inequality is not true then the probabilities are given by the following equations.

Page 22: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4. Simulation Results

In order to study how successfully the EEP method classifies the residues as effective or ineffective, the percentages of misclassified residues were calculated.

The accuracy and variance of the EEP estimates were compared to the ML estimates.

Finally, the robustness of the EEP method for the choice of the threshold value was examined.

Page 23: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4. Simulation Results

The theoretical simulation set was composed of three effective residues: alanine (35%), glycine (50%),and methionine (10%).

The other residues were ineffective and were assigned by sharing the remaining probability in the same proportion as their background probabilities

Page 24: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Page 25: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4.1 False effective and Ineffective Residues

Among the effective residues, alanine and glycine were correctly classified through all simulations.

The number of misclassified methionine residues increased from 0.8 to 2.9% as the threshold value was increased from 1 to 2.

As the threshold value increases, the classification of effective residues whose probabilities are relatively low might fail.

This problem, however, disappears as the number of estimated sequences increases.

Page 26: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4.1 False Effectives and Ineffective residue (continued)

When the threshold value was set to 1, cysteine, histidine, and tryptophan were misclassified in 5.3, 2.6 and 4.1% of the simulations, respectively.

The variations seem to be closely related to the background distribution.

Residues with low background probabilities tend to be more often misclassified than the others.

Page 27: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4.2 Accuracy and variance of estimates

Page 28: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4.2 Accuracy and variance of estimates (continued)

Estimates of effective residues were rather accurate.

There was no great difference between ML and EEP as seen from the figure.

Page 29: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4.2 Accuracy and variance of estimates (continued)

Page 30: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4.2 Accuracy and variance of estimates (continued)

Considering the ineffective residue estimates, as can be seen from the figure there is a great difference between ML and EEP estimates.

Variance in EEP estimates was clearly lot lesser than in ML estimates.

Page 31: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

4.3 Choosing the threshold value

To examine the effect of threshold value on estimation, data was estimated using incorrect threshold.

For threshold value less than true threshold sensitivity improved and specificity worsened

Opposite was true for threshold values greater than true threshold.

As far as accuracy was concerned sensitivity seemed more important than specificity.

Page 32: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

5. Conclusion

The major advantage is the decrease in the dimension of the parameter space. In protein sequence alignments, the decrease is significant because in conserved positions only a few residues can be considered as effective.

The study with 20 well-defined protein families indicates that the EEP method is able to detect sequences on average with 98% sensitivity and 99% specificity.

Page 33: Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

5. Conclusion (continued)

As a consequence of the reduction of the parameter space, the variance of the ineffective residues decreases without influencing variance of the effective residues.

The major disadvantage is its inability to take into account the physical and chemical characteristics of the amino acids, and thus, it ignores the relationships among the amino acids.