HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

21
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics

Transcript of HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Page 1: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

HMMs for alignments & Sequence pattern discovery

I519 Introduction to Bioinformatics

Page 2: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Contents Motifs

– We have seen motifs in regular expression– Profiles & consensus

Motif search– sequence motifs represent critical positions that are

conserved in evolution, so search algorithms employing motifs may be used to identify more divergent sequences than methods based on global sequence similarity

PSI-BLAST (similarity search using PSSM, Position Specific Scoring Matrix)

HMM of protein family (a very brief introduction)

Page 3: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Motifs: Profiles and Consensus a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4

Consensus A C G T A C G T

Line up the patterns by their start indexes

s = (s1, s2, …, st)

Construct matrix profile with frequencies of each nucleotide in columns

Consensus nucleotide in each position has the highest score in column

Page 4: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Profile Representation of Protein Families

Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position.

Protein family can be represented by a Protein family can be represented by a 20·n profile profile representing frequencies of amino acids.representing frequencies of amino acids.

Page 5: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Profiles and HMMs

HMMs can also be used for aligning a HMMs can also be used for aligning a sequence against a profile representing sequence against a profile representing protein family.protein family.

A A 20·n20·n profile profile PP corresponds to corresponds to n n sequentially linked sequentially linked matchmatch states states MM11,,

…,M…,Mnn in the in the profile HMMprofile HMM of of P.P.

Page 6: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Multiple Alignments and Protein Family Classification

Multiple alignment of a protein family shows variations in conservation along the length of a protein

Example: after aligning many globin proteins, the biologists recognized that the helices region in globins are more conserved than others.

Page 7: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

What are Profile HMMs ? A Profile HMM is a probabilistic representation of

a multiple alignment. A given multiple alignment (of a protein family) is

used to build a profile HMM. This model then may be used to find and score

less obvious potential matches of new protein sequences.

Page 8: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Profile HMM

A profile HMMA profile HMM

Page 9: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Building a Profile HMM Multiple alignment is used to construct the HMM

model. Assign each column to a Match state in HMM. Add

Insertion and Deletion state. Estimate the emission probabilities according to

amino acid counts in column. Different positions in the protein will have different emission probabilities.

Estimate the transition probabilities between Match, Deletion and Insertion states

The HMM model gets trained to derive the optimal parameters.

Page 10: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

States of Profile HMM Match states Match states MM11……MMnn (plus (plus begin/endbegin/end states) states)

Insertion states Insertion states II00II11……IInn

Deletion states Deletion states DD11……DDnn

Page 11: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Transition Probabilities in Profile HMM

log(alog(aMIMI)+log(a)+log(aIMIM) = ) = gap initiation penaltygap initiation penalty

log(alog(aIIII) = gap extension penaltygap extension penalty

Page 12: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Emission Probabilities in Profile HMM

• Probabilty of emitting a symbol Probabilty of emitting a symbol a a at an at an insertion stateinsertion state I Ijj::

eeIjIj(a) = p(a)(a) = p(a)

where where p(a)p(a) is the frequency of the is the frequency of the occurrence of the symbol occurrence of the symbol a a in all the in all the sequences.sequences.

Page 13: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Profile HMM Alignment Define Define vvMM

jj (i)(i) as the logarithmic likelihood score of as the logarithmic likelihood score of

the best path for matching the best path for matching xx11..x..xii to profile HMM to profile HMM

ending with ending with xxii emitted by the state emitted by the state MMjj..

vvIIj j (i) (i) andand v vDD

j j (i) (i) are defined similarly.are defined similarly.

Page 14: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Profile HMM Alignment: Dynamic Programming

vvMMj-1j-1(i-1) + log(a(i-1) + log(aMMj-1,j-1,MMj j ))

vvMMjj(i) = log (e(i) = log (eMMjj(x(xii)/p(x)/p(xii)) + max v)) + max vII

j-1j-1(i-1) + log(a(i-1) + log(aIIj-1j-1,,MMj j ))

vvDDj-1j-1(i-1) + log(a(i-1) + log(aDDj-1j-1,,MMj j ))

vvMMjj(i-1) + log(a(i-1) + log(aMMjj, I, Ijj))

vvIIjj(i) = log (e(i) = log (eIIjj(x(xii)/p(x)/p(xii)) + max v)) + max vII

jj(i-1) + log(a(i-1) + log(aIIjj, I, Ijj))

vvDDjj(i-1) + log(a(i-1) + log(aDDjj, I, Ijj))

Page 15: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Paths in Edit Graph and Profile HMM

A path through an edit graph and the corresponding path through a profile HMM

Page 16: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Making a Collection of HMM for Protein Families

Use Blast to separate a protein database into families of related proteins

Construct a multiple alignment for each protein family.

Construct a profile HMM model and optimize the parameters of the model (transition and emission probabilities).

Align the target sequence against each HMM to find the best fit between a target sequence and an HMM

Page 17: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Application of Profile HMM to Modeling Globin Proteins

Globins represent a large collection of protein sequences

400 globin sequences were randomly selected from all globins and used to construct a multiple alignment.

Multiple alignment was used to assign an initial HMM

This model then get trained repeatedly with model lengths chosen randomly between 145 to 170, to get an HMM model optimized probabilities.

Page 18: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

hmmer package Tools for making HMMs and for hmmscan

hmmer3 (as fast as blast)

Page 19: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Sequence Pattern (Motif) Discovery Finding patterns in multiple alignments, or in

unaligned sequences eMotif (a protein pattern database); eBLOCKs Gibbs and MEME

– To infer patterns in unaligned sequences– Gibbs program starts with a fixed pattern length of W and a

random set of locations of the pattern in given input sequences (i.e., the initial pattern is random); and then one sequence is selected at a time randomly and an attempt is made to improve its pattern position.

– MEME uses many similar concepts, but uses the EM (expectation maximization) method.

Page 20: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Utilization of Multiple Alignments Residue conservation

– Jalview Subfamilies

– SCI-PHY– FunShift

Page 21: HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Readings Chapter 6