10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this...

Post on 01-Jan-2016

217 views 0 download

Tags:

Transcript of 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this...

04/20/23 1

Multiple sequence alignment

04/20/23 2

Copyright notice

• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

04/20/23 3

Multiple sequence alignment: definition

• a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned

• Homologous residues are aligned in columns across the length of the sequences

• residues are homologous in an evolutionary sense

• residues are homologous in a structural sense

04/20/23 4

Multiple sequence alignment: properties

• not necessarily one “correct” alignment of a protein family

• protein sequences evolve...

• ...the corresponding three-dimensional structures of proteins also evolve

• may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment

• for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures

04/20/23 5

Multiple sequence alignment: features

• some aligned residues, such as Cysteines that form disulfide bridges, may be highly conserved

• there may be conserved motifs such as a transmembrane domain

• there may be conserved secondary structure features

• there may be regions with consistent patterns of insertions or deletions (indels)

04/20/23 6

Multiple sequence alignment: uses

• MSA is more sensitive than pairwise alignment to detect homologs

• BLAST output can take the form of a MSA, and can reveal conserved residues or motifs

• Population data can be analyzed in a MSA (PopSet)

• A single query can be searched against a database of MSAs

• Regulatory regions of genes may have consensus sequences identifiable by MSA

04/20/23 7

Multiple Sequence Alignment: Approaches

• Optimal Global Alignments -Dynamic programming

• Global Progressive Alignments - Match closely-related sequences first using a guide tree. (Feng & Doolittle)

• Global Iterative Alignments - Multiple re-building attempts to find best alignment

• Local alignments– Profiles, Blocks, Patterns

04/20/23 8

Dynamic Programming

• Generalization of Needleman-Wunsch– Find alignment that maximizes a score function

• Computationally expensive: Time grows as product of sequence lengths– 2 sequences: O(n2)– 3 sequences: O(n3)– 4 sequence: O(n4)– N sequences: O(nN)

• Can align about 7 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that

04/20/23 9

Progressive Alignment

• Find succession of pairwise alignments

• Heurisic – cannot separate scoring and optimization

• Works well for closely related sequences

• Very sensitive to initial alignments

10

Progressive Alignment

• Use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order in the guide tree– Align the most closely related sequence then

add the next more closely related sequence, iteratively

– Full DP algorithm is used by aligning two existing alignments or sequences

– Gaps in present/older alignments remain fixed

04/20/23 11

Progessive Alignment Examples

• Feng-Doolittle (1987)

• ClustalW

• T-coffee

04/20/23 12

Feng-Doolittle MSA occurs in 3 stages

• [1] Do a set of global pairwise alignments (Needleman and Wunsch)

• [2] Create a guide tree

• [3] Progressively align the sequences

04/20/23 13

Progressive MSA stage 1 of 3:generate global pairwise alignments

five distantly related lipocalins

best score

04/20/23 14

Progressive MSA stage 1 of 3:generate global pairwise alignments

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

five closely related lipocalins

best score

04/20/23 15

Number of pairwise alignments needed

For N sequences, (N-1)(N)/2

For 5 sequences, (4)(5)/2 = 10

04/20/23 16

Feng-Doolittle stage 2: guide tree

• Convert similarity scores to distance scores

• A tree shows the distance between objects

• ClustalW provides a syntax to describe the tree

• A guide tree is not a phylogenetic tree

04/20/23 17

Guide Tree

• UPGMA – Unweighted Pair Group Method by Arithmetic Mean– Simplest method of tree construction– Assumes equal rates of mutation along the branches

• UPGMA Algorithm– Definition: Node in a tree is called an Operational

Taxonomic Unit (OTU)– From distance matrix, cluster pair of OTUs with

smallest distance, and calculate new distance– Repeat previous step until clusters converge

04/20/23 18

Guide Tree - UPGMA

• Cluster pair with smallest distance

• Recalculate distance matrix

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

04/20/23 19

Guide Tree - UPGMA

• Calculate new distance using composite OTU(A,B):– Distance between a simple OTU and a composite OTU is

the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU

dist (A,B),C = (dist A,C + dist B,C) / 2 = (4 + 4) / 2 = 4dist (A,B),D = (dist A,D + dist B,D) / 2 = (6 + 6) / 2 = 6dist (A,B),E = (dist A,E + dist B,E) / 2 = (6 + 6) / 2 = 6 dist (A,B),F = (dist A,F + dist B,F) / 2 = (8 + 8) / 2 = 8

04/20/23 20

Guide Tree - UPGMA

• Calculate new distance using composite OTU(A,B):– Distance between a simple OTU and a composite OTU is

the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU

A,B C D E

C 4

D 6 6

E 6 6 4

F 8 8 8 8

04/20/23 21

Guide Tree - UPGMA

• Second Iteration

A,B C D E

C 4

D 6 6

E 6 6 4

F 8 8 8 8

04/20/23 22

Guide Tree - UPGMA

• Third Iteration

A,B C D,E

C 4

D,E 6 6

F 8 8 8

04/20/23 23

Guide Tree - UPGMA

• Fourth Iteration

AB,C D,E

D,E 6

F 8 8

04/20/23 24

Guide Tree - UPGMA

• Fifth Iteration

ABC,DE

F 8

25

Guide Tree

• ClustalW uses Neighbor-Joining• Neighbor Joining corrects the UPGMA method for its

(frequently invalid) assumption that the same rate of evolution applies to each branch of a tree.

• Neighbor Joining has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)

• Neighbor-Joining Algorithm• Assumes unequal rates of mutation along each branch• Find pairs of OTUs that minimize total branch length at

each stage of clustering starting with a starlike tree (Minimum-Evolution Tree).The distance matrix is adjusted for differences in the rate of evolution of each taxon (branch).

NJ Algorithm

Neighbor Joining to Calculate the Guide Tree Phase:– does not require a uniform molecular clock– the raw data are provided as a distance matrix– the initial tree is a star tree– distance matrix is modified

• distance between node pairs is adjusted on the basis of their average divergence from all other nodes.

– the least-distant pair of nodes are linked.

NJ Algorithm

Neighbor Joining to Calculate the Guide Tree Phase:– When two nodes are linked:

• Add their common ancestral node to the tree• delete the terminal nodes with their branches • the common ancestor is now a terminal node on a smaller

tree

– At each step, two terminal nodes are replaced by one new node

– The process is complete when there are only two nodes separated by a single branch

NJ Algorithm

• Advantages of Neighbor Joining– Fast.

• Can be used on large datasets• Can support bootstrap analysis

– Can handle lineages with largely different branch lengths (different molecular evolutionary rates)

– Can be used with methods that use correction for multiple substitutions

NJ Algorithm

• Disadvantages of Neighbor Joining– sequence information is reduced

• Sequences are boiled down to distances• No secondary or tertiary features used

– gives only one possible tree – strongly dependent on the model of evolution used

NJ Algorithm

• NJ example from: http://www.icp.ucl.ac.be/~opperd/private/neighbor.html

• Consider the following tree:

• Notice that the branches for D and B are longer.

• This expresses the idea that they have a faster molecular clock than the other OTUs.

NJ Algorithm

The distance matrix for the tree is:

A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8

Normally, we create the tree from the distances.

In this example, we use to tree to derive the distances.

NJ Algorithm

• We start with a star tree.• Notice that we have 6 operational taxonomic

units (OTUs)• The start tree has a leaf for each OTU

A

B

C D

E

F

NJ Algorithm

Step 1: Calculate the net divergence for each OTU.The net divergence is the sum of distances from i to all

other OTUs.

A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8

r(A) = 5+4+7+6+8=30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44

N

i jiijiXX D

NLr

1 1

1

NJ Algorithm

Step 2: Calculate a new distance matrix based on average divergence:M(ij)=d(ij) - [r(i) + r(j)]/(N-2)

Example: A,B

M(AB)=d(AB) -[(r(A) + r(B)]/(N-2) = -13

A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8

Recall:r(A) =30r(B) = 42

NJ Algorithm

Step 2: continuedM(ij)=d(ij) - [r(i) + r(j)]/(N-2)

A B C D EB -13.0C -11.5 -11.5D -10.0 -10.0 -10.5E -10.0 -10.0 -10.5 -13.0F -10.5 -10.5 -11.0 -11.5 -11.5

A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8

Distance matrix Average divergence matrix

NJ Algorithm

Step 3: choose two OTUs for which Mij is the smallest.– the possible choices are: A,B and D,E– arbitrarily choose A and B– form a new node called U, the parent of A & B.– calculate the branch length from U to A and B.

S(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1S(BU) =d(AB) -S(AU) = 4

NJ Algorithm

• The tree after U is added.

A

B C

D

E

F

U 1

4

NJ Algorithm

Step 4: define distances from U to other terminal nodes:– d(CU) = d(AC) + d(BC) - d(AB) / 2 = 3– d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6– d(EU) = d(AE) + d(BE) - d(AB) / 2 = 5– d(FU) = d(AF) + d(BF) - d(AB) / 2 = 7– Note: no change in paired distances {C,D,E,F}

U C D EC 3D 6 7E 5 6 5F 7 8 9 8

NJ Algorithm

• Now N = N-1 = 5• Repeat steps 1 through 4• Stop when N = 2

04/20/23 40

Progressive MSA stage 2 of 3:generate a guide tree calculated from

the distance matrix

04/20/23 41

Progressive MSA stage 2 of 3:generate a guide tree calculated from

the distance matrix

04/20/23 42

Progressive MSA stage 2 of 3:generate guide tree

((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);

five closely related lipocalins

04/20/23 43

Progressive MSA stage 2 of 3:generate guide tree

((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);

five closely related lipocalins

04/20/23 44

Feng-Doolittle stage 3: progressive alignment

• Make a MSA based on the order in the guide tree

• Start with the two most closely related sequences

• Then add the next closest sequence

• Continue until all sequences are added to the MSA

• Rule: “once a gap, always a gap.”

04/20/23 45

Use Clustal W to do a progressive MSA

http://www2.ebi.ac.uk/clustalw/

04/20/23 46

Progressive MSA stage 3 of 3:progressively align the sequences

following the branch order of the tree

04/20/23 47

Clustal W alignment of 5 closely related lipocalins

CLUSTAL W (1.82) multiple sequence alignment

gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:*****

gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:**************

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********

04/20/23 48

Why “once a gap, always a gap”?

• There are many possible ways to make a MSA

• Where gaps are added is a critical question

• Gaps are often added to the first two (closest) sequences

• To change the initial gap choices later on would be to give more weight to distantly related sequences

• To maintain the initial gap choices is to trust that those gaps are most believable

04/20/23 49

Progressive Alignment: Discussion

• Strengths:– Speed– Progression biologically sensible (aligns using a tree)

• Weaknesses:– No objective function.– No way of quantifying whether or not the alignment is

good– Local minimum problem– Any errors in the initial alignment are carried through, no way to

correct an early mistake– More efficient for closely related sequences than for divergent

sequences

04/20/23 50

Iterative Methods for Multiple Sequence Alignment

• Seeks to increase MSA score by randomly altering the alignment.

• Usually used to refine alignment• Attempt to correct initial alignment problems by

repeatedly aligning subgroups of the sequences and then by aligning these subgroups into a global alignment of all the sequences– Starts with a multiple sequence alignment.– Refine it. – Repeat until one MSA doesn’t change significantly

from the next.

04/20/23 51

MultAlign

• Pairwise scores recalculated during progressive alignment

• Tree is recalculated

• Alignment is refined

04/20/23 52

PRRP

• Initial pairwise alignment predicts tree

• Tree produces weights

• Locally aligned regions considered to produce new alignment and tree

• Continue until alignments converge

04/20/23 53

DIALIGN

• Pairs of sequences aligned to locate ungapped aligned regions

• Diagonals of various lengths identified

• Collection of weighted diagonals provide alignment

04/20/23 54

SAGA: Genetic Algorithms

• Generate as many different MSAs by rearrangements simulating gaps and recombination events

• SAGA (Serial Alignment by Genetic Algorithm) is one approach

04/20/23 55

Simulated Annealing

• Obtain a higher-scoring multiple alignment

• Rearranges current alignment using probabalistic approach to identify changes that increase alignment score

• MSASA: Multiple Sequence Alignment by Simulated Annealing

MUSCLE: next-generation progressive MSA

[1] Build a draft progressive alignmentDetermine pairwise similarity through k-mer counting (not by

alignment)

Compute distance (triangular distance) matrix

Construct tree using UPGMA

Construct draft progressive alignment following tree

MUSCLE: next-generation progressive MSA

[2] Improve the progressive alignment Compute pairwise identity through current MSA

Construct new tree with Kimura distance measures

Compare new and old trees: if improved, repeat this step, if not improved, then we’re done

MUSCLE: next-generation progressive MSA

[3] Refinement of the MSA Split tree in half by deleting one edge

Make profiles of each half of the tree

Re-align the profiles

Accept/reject the new alignment

MUSCLE output (formatted with SeaView)

SeaView is a graphical multiple sequence alignment editor available at http://pbil.univ-lyon1.fr/software/seaview.html

04/20/23 61

Scoring Multiple Alignments

• Because we can’t see the ancestral sequences, it is often impossible to ever know what is the “correct” multiple alignment. (Since some residues may not be structurally superposable, there may not be a correct alignment.)

• The best we can do is to define a “scoring function” for evaluating the “goodness” of a multiple alignment.

• We then try to find the multiple alignment that maximizes this function.

• This is entirely analogous to the scoring function used in pairwise alignment algorithms.

04/20/23 62

Scoring Function Features

• The key difference between multiple alignments and pairwise alignments is the fact that different pairs of sequences are separated by different evolutionary distances.

• Any set of sequences we wish to align is related by a phylogenetic tree.

• Ideally, our scoring system should model molecular sequence evolution.

04/20/23 63

Ideal Scoring Function

• Sequences are related by an evolutionary tree.

• Assume a probabilistic model of molecular evolution.

• Multiple alignment score, S, is

S = ΣX Pr(Tree|Root=X) Pr(X)

D

A B

C

Root

04/20/23 64

Ideal Function is Too Complex

• In most cases, we don’t have nearly enough information to model evolution accurately enough.

• The probability depends on knowing the length of each branch in the tree accurately.

• Evolution is not constant at each column in the alignment since selective pressure is stronger on critical residues.

04/20/23 65

Scoring Function Features cont’d

• As with pairwise alignments, the scoring function take the chemical/physical properties of residues into account.

04/20/23 66

Simple Score Functions

• If we assume that the columns of the alignment are independent, the scoring function can be written as a sum of column scores plus a gap score:

S(m) = G + Σi S(mi)

where mi is column i of the alignment and G is

a function for scoring gaps.

04/20/23 67

Sum of Pairs: SP Scores

• Using BLOSUM62 matrix, gap penalty -8

• In column 1, we have pairs-,S-,SS,S

• k(k-1)/2 pairs per column

- I K

S I K

S S E

-8 - 8 + 4 = -12

04/20/23 68

Problems with Sum of Pairs Scores

SP scores are very commonly used, but they have problems:

• They have no probabilistic justification.

• The relative difference in score between the correct and incorrect alignment decreases as the evidence increases—this is counter-intuitive.

04/20/23 69

Minimum Entropy Scores

• This is a probabilistic (well, information theoretic) way of saying how “pure” or “good” an alignment column is.

• Intuition: good alignment columns will contain very few different letters

• Method: We convert the alignment column into a probability vector and compute the entropy of the vector.

04/20/23 70

Entropy

• Entropy is a very useful concept from Information Theory.

• If X is a random variable that can have values X1,X2,…,Xk, the entropy of X is defined as:

H(X) = −Σj Pr(Xj) log Pr(Xj)• The maximum entropy is log k. when the

distribution is uniform, eg, Pr(X) = (¼, ¼, ¼, ¼).• The minimum entropy is 0, when the distribution

puts all its weight on one letter, eg, Pr(X) = (0,0,1,0).

Entropy

• Define frequencies for the occurrence of each letter in each column of multiple alignment– pA = 1, pT=pG=pC=0 (1st column)

– pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)

– pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)

• Compute entropy of each column

CGTAX

XX pp,,,

log

AAAAAAAATATC

Entropy: Example

0

A

A

A

A

entropy

2)24

1(4

4

1log

4

1

C

G

T

A

entropy

Best case

Worst case

Multiple Alignment: Entropy Score

Entropy for a multiple alignment is the sum of entropies of its columns:

over all columns X=A,T,G,C pX logpX

Entropy of an Alignment: Example

column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)

•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0

•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811

•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0

•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

A A A

A C C

A C G

A C T

04/20/23 75

Pros and Cons of Entropy Scores

• The entropy scores are probabilistic.

• They don’t take into account the fact that the sequences are related by a phylogenetic tree. This can be “fixed” by weighting the sequences so that sequences from close species are downweighted relative to sequences from distant species.

04/20/23 76

Multiple sequence alignment to profile: HMMs

• Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged In a column of a multiple sequence alignment

• HMMs are probabilistic models

• An HMM gives more sensitive alignments than traditional techniques such as progressive alignments

Simple Hidden Markov Model

Observation: YNNNYYNNNYN

(Y=goes out, N=doesn’t go out)

What is underlying reality (the hidden state chain)?

R

S

0.15

0.85

0.2

0.8

P(dog goes out in rain) = 0.1

P(dog goes out in sun) = 0.85

04/20/23 78

GTWYA (hs RBP)GLWYA (mus RBP)GRWYE (apoD)GTWYE (E Coli)GEWFS (MUP4)

An HMM is constructed from a MSA

Example: five lipocalins

04/20/23 79

GTWYAGLWYAGRWYEGTWYEGEWFS

Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2

04/20/23 80

GTWYAGLWYAGRWYEGTWYEGEWFS

Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2

P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064

log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75

04/20/23 81

GTWYAGLWYAGRWYEGTWYEGEWFS

P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064

log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75

G:1.0T:0.4L:0.2R:0.2E:0.2

W:1.0Y:0.8F:0.2

E:0.4A:0.4S:0.2

Structure of a hidden Markov model (HMM)

main state

insert state

delete state

04/20/23 83

From MSA to Profile

• Profile HMMs are important because they provide a powerful way to search databases for distantly related homologs.

• HMMs can be created using the HMMER program.

04/20/23 84

HMMER: search an HMM against GenBankScores for complete sequences (score includes all domains):Sequence Description Score E-value N-------- ----------- ----- ------- ---gi|20888903|ref|XP_129259.1| (XM_129259) ret 461.1 1.9e-133 1gi|132407|sp|P04916|RETB_RAT Plasma retinol- 458.0 1.7e-132 1gi|20548126|ref|XP_005907.5| (XM_005907) sim 454.9 1.4e-131 1gi|5803139|ref|NP_006735.1| (NM_006744) ret 454.6 1.7e-131 1gi|20141667|sp|P02753|RETB_HUMAN Plasma retinol- 451.1 1.9e-130 1..gi|16767588|ref|NP_463203.1| (NC_003197) out 318.2 1.9e-90 1

gi|5803139|ref|NP_006735.1|: domain 1 of 1, from 1 to 195: score 454.6, E = 1.7e-131 *->mkwVMkLLLLaALagvfgaAErdAfsvgkCrvpsPPRGfrVkeNFDv mkwV++LLLLaA + +aAErd Crv+s frVkeNFD+ gi|5803139 1 MKWVWALLLLAA--W--AAAERD------CRVSS----FRVKENFDK 33

erylGtWYeIaKkDprFErGLllqdkItAeySleEhGsMsataeGrirVL +r++GtWY++aKkDp E GL+lqd+I+Ae+S++E+G+Msata+Gr+r+L gi|5803139 34 ARFSGTWYAMAKKDP--E-GLFLQDNIVAEFSVDETGQMSATAKGRVRLL 80

eNkelcADkvGTvtqiEGeasevfLtadPaklklKyaGvaSflqpGfddy +N+++cAD+vGT+t++E dPak+k+Ky+GvaSflq+G+dd+ gi|5803139 81 NNWDVCADMVGTFTDTE----------DPAKFKMKYWGVASFLQKGNDDH 120

Two kinds of multiple sequence alignment resources

Text-based or query-based searches:CDD, Pfam (profile HMMs), PROSITE

[2] Multiple sequence alignment programs

Muscle, ClustalW, ClustalX

[1] Databases of multiple sequence alignments

Page 329

BLOCKSCDD Pfam SMARTDOMO (Gapped MSA)INTERPROiProClassMetaFAMPRINTSPRODOM (PSI-BLAST)PROSITE

Databases of multiple sequence alignments

TheseUseHMMs

04/20/23 87

Multiple sequence alignment programs

• AMAS• CINEMA• ClustalW• ClustalX• DIALIGN• HMMT• Match-Box• MultAlin• MSA• Musca• PileUp• SAGA• T-COFFEE

04/20/23 88

Multiple sequence alignment algorithms

Progressive

Iterative

Local Global

PIMA

DIALIGN SAGA

CLUSTALPileUpother

04/20/23 89

performance of alignment programs depends on (McClure et al., 1994)

• the number of sequences,

• the degree of similarity between sequences

• the number of insertions in the alignment.

• the length of the sequences

• the existence of large insertions and N/C-terminal extensions

• over-representation of some members of the protein family.

04/20/23 90

Strategy for assessment of alternativemultiple sequence alignment algorithms

• [1] Create or obtain a database of protein sequences for which the 3D structure is known. Thus we can define “true” homologs using structural criteria.

• [2] Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers).

• [3] Compare the answers.

04/20/23 91

BAliBASE: A benchmark alignments database for the evaluation of multiple

sequence alignment programs

• BAliBASE is a database of manually-refined multiple sequence alignments specifically designed for the evaluation and comparison of multiple sequence alignment programs. The alignments are categorised by sequence length, similarity, and presence of insertions and N/C- terminal extensions. Core blocks are identified excluding non-superposable regions.

• http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/

04/20/23 92

BaliBase

• Thompson et al., 1999, Nuc. Acids. Res. 27, 2682-2690).

• DIALIGN was found to be the best method for local multiple alignment.

• CLUSTAL W, PRRP and SAGA were superior on globally related sequence sets

04/20/23 93

Conclusions: assessment of alternativemultiple sequence alignment algorithms

• [1] As percent identity among proteins drops, performance (accuracy) declines also. This is especially severe for proteins < 25% identity.– Proteins <25% identity: 65% of residues

align well– Proteins <40% identity: 80% of residues

align well

04/20/23 94

Conclusions: assessment of alternativemultiple sequence alignment algorithms

• [2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.

04/20/23 95

Conclusions: assessment of alternativemultiple sequence alignment algorithms

• [3] Separate multiple sequence alignments can be combined (e.g. RBPs and lactoglobulins).– Iterative algorithms (PRRP, SAGA)

outperform progressive alignments (ClustalX)

04/20/23 96

Conclusions: assessment of alternativemultiple sequence alignment algorithms

• [4] When proteins have large N-terminal or C-terminal extensions, local alignment algorithms are superior. PileUp (global) is an exception.