10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this...

96
12/28/21 1 Multiple sequence alignment

Transcript of 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this...

Page 1: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 1

Multiple sequence alignment

Page 2: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 2

Copyright notice

• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

Page 3: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 3

Multiple sequence alignment: definition

• a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned

• Homologous residues are aligned in columns across the length of the sequences

• residues are homologous in an evolutionary sense

• residues are homologous in a structural sense

Page 4: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 4

Multiple sequence alignment: properties

• not necessarily one “correct” alignment of a protein family

• protein sequences evolve...

• ...the corresponding three-dimensional structures of proteins also evolve

• may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment

• for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures

Page 5: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 5

Multiple sequence alignment: features

• some aligned residues, such as Cysteines that form disulfide bridges, may be highly conserved

• there may be conserved motifs such as a transmembrane domain

• there may be conserved secondary structure features

• there may be regions with consistent patterns of insertions or deletions (indels)

Page 6: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 6

Multiple sequence alignment: uses

• MSA is more sensitive than pairwise alignment to detect homologs

• BLAST output can take the form of a MSA, and can reveal conserved residues or motifs

• Population data can be analyzed in a MSA (PopSet)

• A single query can be searched against a database of MSAs

• Regulatory regions of genes may have consensus sequences identifiable by MSA

Page 7: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 7

Multiple Sequence Alignment: Approaches

• Optimal Global Alignments -Dynamic programming

• Global Progressive Alignments - Match closely-related sequences first using a guide tree. (Feng & Doolittle)

• Global Iterative Alignments - Multiple re-building attempts to find best alignment

• Local alignments– Profiles, Blocks, Patterns

Page 8: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 8

Dynamic Programming

• Generalization of Needleman-Wunsch– Find alignment that maximizes a score function

• Computationally expensive: Time grows as product of sequence lengths– 2 sequences: O(n2)– 3 sequences: O(n3)– 4 sequence: O(n4)– N sequences: O(nN)

• Can align about 7 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that

Page 9: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 9

Progressive Alignment

• Find succession of pairwise alignments

• Heurisic – cannot separate scoring and optimization

• Works well for closely related sequences

• Very sensitive to initial alignments

Page 10: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

10

Progressive Alignment

• Use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order in the guide tree– Align the most closely related sequence then

add the next more closely related sequence, iteratively

– Full DP algorithm is used by aligning two existing alignments or sequences

– Gaps in present/older alignments remain fixed

Page 11: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 11

Progessive Alignment Examples

• Feng-Doolittle (1987)

• ClustalW

• T-coffee

Page 12: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 12

Feng-Doolittle MSA occurs in 3 stages

• [1] Do a set of global pairwise alignments (Needleman and Wunsch)

• [2] Create a guide tree

• [3] Progressively align the sequences

Page 13: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 13

Progressive MSA stage 1 of 3:generate global pairwise alignments

five distantly related lipocalins

best score

Page 14: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 14

Progressive MSA stage 1 of 3:generate global pairwise alignments

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

five closely related lipocalins

best score

Page 15: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 15

Number of pairwise alignments needed

For N sequences, (N-1)(N)/2

For 5 sequences, (4)(5)/2 = 10

Page 16: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 16

Feng-Doolittle stage 2: guide tree

• Convert similarity scores to distance scores

• A tree shows the distance between objects

• ClustalW provides a syntax to describe the tree

• A guide tree is not a phylogenetic tree

Page 17: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 17

Guide Tree

• UPGMA – Unweighted Pair Group Method by Arithmetic Mean– Simplest method of tree construction– Assumes equal rates of mutation along the branches

• UPGMA Algorithm– Definition: Node in a tree is called an Operational

Taxonomic Unit (OTU)– From distance matrix, cluster pair of OTUs with

smallest distance, and calculate new distance– Repeat previous step until clusters converge

Page 18: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 18

Guide Tree - UPGMA

• Cluster pair with smallest distance

• Recalculate distance matrix

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

Page 19: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 19

Guide Tree - UPGMA

• Calculate new distance using composite OTU(A,B):– Distance between a simple OTU and a composite OTU is

the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU

dist (A,B),C = (dist A,C + dist B,C) / 2 = (4 + 4) / 2 = 4dist (A,B),D = (dist A,D + dist B,D) / 2 = (6 + 6) / 2 = 6dist (A,B),E = (dist A,E + dist B,E) / 2 = (6 + 6) / 2 = 6 dist (A,B),F = (dist A,F + dist B,F) / 2 = (8 + 8) / 2 = 8

Page 20: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 20

Guide Tree - UPGMA

• Calculate new distance using composite OTU(A,B):– Distance between a simple OTU and a composite OTU is

the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU

A,B C D E

C 4

D 6 6

E 6 6 4

F 8 8 8 8

Page 21: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 21

Guide Tree - UPGMA

• Second Iteration

A,B C D E

C 4

D 6 6

E 6 6 4

F 8 8 8 8

Page 22: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 22

Guide Tree - UPGMA

• Third Iteration

A,B C D,E

C 4

D,E 6 6

F 8 8 8

Page 23: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 23

Guide Tree - UPGMA

• Fourth Iteration

AB,C D,E

D,E 6

F 8 8

Page 24: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 24

Guide Tree - UPGMA

• Fifth Iteration

ABC,DE

F 8

Page 25: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

25

Guide Tree

• ClustalW uses Neighbor-Joining• Neighbor Joining corrects the UPGMA method for its

(frequently invalid) assumption that the same rate of evolution applies to each branch of a tree.

• Neighbor Joining has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)

• Neighbor-Joining Algorithm• Assumes unequal rates of mutation along each branch• Find pairs of OTUs that minimize total branch length at

each stage of clustering starting with a starlike tree (Minimum-Evolution Tree).The distance matrix is adjusted for differences in the rate of evolution of each taxon (branch).

Page 26: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

Neighbor Joining to Calculate the Guide Tree Phase:– does not require a uniform molecular clock– the raw data are provided as a distance matrix– the initial tree is a star tree– distance matrix is modified

• distance between node pairs is adjusted on the basis of their average divergence from all other nodes.

– the least-distant pair of nodes are linked.

Page 27: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

Neighbor Joining to Calculate the Guide Tree Phase:– When two nodes are linked:

• Add their common ancestral node to the tree• delete the terminal nodes with their branches • the common ancestor is now a terminal node on a smaller

tree

– At each step, two terminal nodes are replaced by one new node

– The process is complete when there are only two nodes separated by a single branch

Page 28: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

• Advantages of Neighbor Joining– Fast.

• Can be used on large datasets• Can support bootstrap analysis

– Can handle lineages with largely different branch lengths (different molecular evolutionary rates)

– Can be used with methods that use correction for multiple substitutions

Page 29: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

• Disadvantages of Neighbor Joining– sequence information is reduced

• Sequences are boiled down to distances• No secondary or tertiary features used

– gives only one possible tree – strongly dependent on the model of evolution used

Page 30: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

• NJ example from: http://www.icp.ucl.ac.be/~opperd/private/neighbor.html

• Consider the following tree:

• Notice that the branches for D and B are longer.

• This expresses the idea that they have a faster molecular clock than the other OTUs.

Page 31: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

The distance matrix for the tree is:

A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8

Normally, we create the tree from the distances.

In this example, we use to tree to derive the distances.

Page 32: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

• We start with a star tree.• Notice that we have 6 operational taxonomic

units (OTUs)• The start tree has a leaf for each OTU

A

B

C D

E

F

Page 33: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

Step 1: Calculate the net divergence for each OTU.The net divergence is the sum of distances from i to all

other OTUs.

A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8

r(A) = 5+4+7+6+8=30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44

N

i jiijiXX D

NLr

1 1

1

Page 34: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

Step 2: Calculate a new distance matrix based on average divergence:M(ij)=d(ij) - [r(i) + r(j)]/(N-2)

Example: A,B

M(AB)=d(AB) -[(r(A) + r(B)]/(N-2) = -13

A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8

Recall:r(A) =30r(B) = 42

Page 35: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

Step 2: continuedM(ij)=d(ij) - [r(i) + r(j)]/(N-2)

A B C D EB -13.0C -11.5 -11.5D -10.0 -10.0 -10.5E -10.0 -10.0 -10.5 -13.0F -10.5 -10.5 -11.0 -11.5 -11.5

A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8

Distance matrix Average divergence matrix

Page 36: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

Step 3: choose two OTUs for which Mij is the smallest.– the possible choices are: A,B and D,E– arbitrarily choose A and B– form a new node called U, the parent of A & B.– calculate the branch length from U to A and B.

S(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1S(BU) =d(AB) -S(AU) = 4

Page 37: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

• The tree after U is added.

A

B C

D

E

F

U 1

4

Page 38: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

Step 4: define distances from U to other terminal nodes:– d(CU) = d(AC) + d(BC) - d(AB) / 2 = 3– d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6– d(EU) = d(AE) + d(BE) - d(AB) / 2 = 5– d(FU) = d(AF) + d(BF) - d(AB) / 2 = 7– Note: no change in paired distances {C,D,E,F}

U C D EC 3D 6 7E 5 6 5F 7 8 9 8

Page 39: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

NJ Algorithm

• Now N = N-1 = 5• Repeat steps 1 through 4• Stop when N = 2

Page 40: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 40

Progressive MSA stage 2 of 3:generate a guide tree calculated from

the distance matrix

Page 41: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 41

Progressive MSA stage 2 of 3:generate a guide tree calculated from

the distance matrix

Page 42: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 42

Progressive MSA stage 2 of 3:generate guide tree

((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);

five closely related lipocalins

Page 43: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 43

Progressive MSA stage 2 of 3:generate guide tree

((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);

five closely related lipocalins

Page 44: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 44

Feng-Doolittle stage 3: progressive alignment

• Make a MSA based on the order in the guide tree

• Start with the two most closely related sequences

• Then add the next closest sequence

• Continue until all sequences are added to the MSA

• Rule: “once a gap, always a gap.”

Page 45: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 45

Use Clustal W to do a progressive MSA

http://www2.ebi.ac.uk/clustalw/

Page 46: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 46

Progressive MSA stage 3 of 3:progressively align the sequences

following the branch order of the tree

Page 47: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 47

Clustal W alignment of 5 closely related lipocalins

CLUSTAL W (1.82) multiple sequence alignment

gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:*****

gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:**************

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********

Page 48: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 48

Why “once a gap, always a gap”?

• There are many possible ways to make a MSA

• Where gaps are added is a critical question

• Gaps are often added to the first two (closest) sequences

• To change the initial gap choices later on would be to give more weight to distantly related sequences

• To maintain the initial gap choices is to trust that those gaps are most believable

Page 49: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 49

Progressive Alignment: Discussion

• Strengths:– Speed– Progression biologically sensible (aligns using a tree)

• Weaknesses:– No objective function.– No way of quantifying whether or not the alignment is

good– Local minimum problem– Any errors in the initial alignment are carried through, no way to

correct an early mistake– More efficient for closely related sequences than for divergent

sequences

Page 50: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 50

Iterative Methods for Multiple Sequence Alignment

• Seeks to increase MSA score by randomly altering the alignment.

• Usually used to refine alignment• Attempt to correct initial alignment problems by

repeatedly aligning subgroups of the sequences and then by aligning these subgroups into a global alignment of all the sequences– Starts with a multiple sequence alignment.– Refine it. – Repeat until one MSA doesn’t change significantly

from the next.

Page 51: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 51

MultAlign

• Pairwise scores recalculated during progressive alignment

• Tree is recalculated

• Alignment is refined

Page 52: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 52

PRRP

• Initial pairwise alignment predicts tree

• Tree produces weights

• Locally aligned regions considered to produce new alignment and tree

• Continue until alignments converge

Page 53: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 53

DIALIGN

• Pairs of sequences aligned to locate ungapped aligned regions

• Diagonals of various lengths identified

• Collection of weighted diagonals provide alignment

Page 54: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 54

SAGA: Genetic Algorithms

• Generate as many different MSAs by rearrangements simulating gaps and recombination events

• SAGA (Serial Alignment by Genetic Algorithm) is one approach

Page 55: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 55

Simulated Annealing

• Obtain a higher-scoring multiple alignment

• Rearranges current alignment using probabalistic approach to identify changes that increase alignment score

• MSASA: Multiple Sequence Alignment by Simulated Annealing

Page 56: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

MUSCLE: next-generation progressive MSA

[1] Build a draft progressive alignmentDetermine pairwise similarity through k-mer counting (not by

alignment)

Compute distance (triangular distance) matrix

Construct tree using UPGMA

Construct draft progressive alignment following tree

Page 57: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

MUSCLE: next-generation progressive MSA

[2] Improve the progressive alignment Compute pairwise identity through current MSA

Construct new tree with Kimura distance measures

Compare new and old trees: if improved, repeat this step, if not improved, then we’re done

Page 58: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

MUSCLE: next-generation progressive MSA

[3] Refinement of the MSA Split tree in half by deleting one edge

Make profiles of each half of the tree

Re-align the profiles

Accept/reject the new alignment

Page 59: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.
Page 60: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

MUSCLE output (formatted with SeaView)

SeaView is a graphical multiple sequence alignment editor available at http://pbil.univ-lyon1.fr/software/seaview.html

Page 61: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 61

Scoring Multiple Alignments

• Because we can’t see the ancestral sequences, it is often impossible to ever know what is the “correct” multiple alignment. (Since some residues may not be structurally superposable, there may not be a correct alignment.)

• The best we can do is to define a “scoring function” for evaluating the “goodness” of a multiple alignment.

• We then try to find the multiple alignment that maximizes this function.

• This is entirely analogous to the scoring function used in pairwise alignment algorithms.

Page 62: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 62

Scoring Function Features

• The key difference between multiple alignments and pairwise alignments is the fact that different pairs of sequences are separated by different evolutionary distances.

• Any set of sequences we wish to align is related by a phylogenetic tree.

• Ideally, our scoring system should model molecular sequence evolution.

Page 63: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 63

Ideal Scoring Function

• Sequences are related by an evolutionary tree.

• Assume a probabilistic model of molecular evolution.

• Multiple alignment score, S, is

S = ΣX Pr(Tree|Root=X) Pr(X)

D

A B

C

Root

Page 64: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 64

Ideal Function is Too Complex

• In most cases, we don’t have nearly enough information to model evolution accurately enough.

• The probability depends on knowing the length of each branch in the tree accurately.

• Evolution is not constant at each column in the alignment since selective pressure is stronger on critical residues.

Page 65: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 65

Scoring Function Features cont’d

• As with pairwise alignments, the scoring function take the chemical/physical properties of residues into account.

Page 66: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 66

Simple Score Functions

• If we assume that the columns of the alignment are independent, the scoring function can be written as a sum of column scores plus a gap score:

S(m) = G + Σi S(mi)

where mi is column i of the alignment and G is

a function for scoring gaps.

Page 67: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 67

Sum of Pairs: SP Scores

• Using BLOSUM62 matrix, gap penalty -8

• In column 1, we have pairs-,S-,SS,S

• k(k-1)/2 pairs per column

- I K

S I K

S S E

-8 - 8 + 4 = -12

Page 68: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 68

Problems with Sum of Pairs Scores

SP scores are very commonly used, but they have problems:

• They have no probabilistic justification.

• The relative difference in score between the correct and incorrect alignment decreases as the evidence increases—this is counter-intuitive.

Page 69: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 69

Minimum Entropy Scores

• This is a probabilistic (well, information theoretic) way of saying how “pure” or “good” an alignment column is.

• Intuition: good alignment columns will contain very few different letters

• Method: We convert the alignment column into a probability vector and compute the entropy of the vector.

Page 70: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 70

Entropy

• Entropy is a very useful concept from Information Theory.

• If X is a random variable that can have values X1,X2,…,Xk, the entropy of X is defined as:

H(X) = −Σj Pr(Xj) log Pr(Xj)• The maximum entropy is log k. when the

distribution is uniform, eg, Pr(X) = (¼, ¼, ¼, ¼).• The minimum entropy is 0, when the distribution

puts all its weight on one letter, eg, Pr(X) = (0,0,1,0).

Page 71: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

Entropy

• Define frequencies for the occurrence of each letter in each column of multiple alignment– pA = 1, pT=pG=pC=0 (1st column)

– pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)

– pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)

• Compute entropy of each column

CGTAX

XX pp,,,

log

AAAAAAAATATC

Page 72: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

Entropy: Example

0

A

A

A

A

entropy

2)24

1(4

4

1log

4

1

C

G

T

A

entropy

Best case

Worst case

Page 73: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

Multiple Alignment: Entropy Score

Entropy for a multiple alignment is the sum of entropies of its columns:

over all columns X=A,T,G,C pX logpX

Page 74: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

Entropy of an Alignment: Example

column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)

•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0

•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811

•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0

•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

A A A

A C C

A C G

A C T

Page 75: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 75

Pros and Cons of Entropy Scores

• The entropy scores are probabilistic.

• They don’t take into account the fact that the sequences are related by a phylogenetic tree. This can be “fixed” by weighting the sequences so that sequences from close species are downweighted relative to sequences from distant species.

Page 76: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 76

Multiple sequence alignment to profile: HMMs

• Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged In a column of a multiple sequence alignment

• HMMs are probabilistic models

• An HMM gives more sensitive alignments than traditional techniques such as progressive alignments

Page 77: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

Simple Hidden Markov Model

Observation: YNNNYYNNNYN

(Y=goes out, N=doesn’t go out)

What is underlying reality (the hidden state chain)?

R

S

0.15

0.85

0.2

0.8

P(dog goes out in rain) = 0.1

P(dog goes out in sun) = 0.85

Page 78: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 78

GTWYA (hs RBP)GLWYA (mus RBP)GRWYE (apoD)GTWYE (E Coli)GEWFS (MUP4)

An HMM is constructed from a MSA

Example: five lipocalins

Page 79: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 79

GTWYAGLWYAGRWYEGTWYEGEWFS

Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2

Page 80: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 80

GTWYAGLWYAGRWYEGTWYEGEWFS

Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2

P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064

log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75

Page 81: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 81

GTWYAGLWYAGRWYEGTWYEGEWFS

P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064

log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75

G:1.0T:0.4L:0.2R:0.2E:0.2

W:1.0Y:0.8F:0.2

E:0.4A:0.4S:0.2

Page 82: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

Structure of a hidden Markov model (HMM)

main state

insert state

delete state

Page 83: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 83

From MSA to Profile

• Profile HMMs are important because they provide a powerful way to search databases for distantly related homologs.

• HMMs can be created using the HMMER program.

Page 84: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 84

HMMER: search an HMM against GenBankScores for complete sequences (score includes all domains):Sequence Description Score E-value N-------- ----------- ----- ------- ---gi|20888903|ref|XP_129259.1| (XM_129259) ret 461.1 1.9e-133 1gi|132407|sp|P04916|RETB_RAT Plasma retinol- 458.0 1.7e-132 1gi|20548126|ref|XP_005907.5| (XM_005907) sim 454.9 1.4e-131 1gi|5803139|ref|NP_006735.1| (NM_006744) ret 454.6 1.7e-131 1gi|20141667|sp|P02753|RETB_HUMAN Plasma retinol- 451.1 1.9e-130 1..gi|16767588|ref|NP_463203.1| (NC_003197) out 318.2 1.9e-90 1

gi|5803139|ref|NP_006735.1|: domain 1 of 1, from 1 to 195: score 454.6, E = 1.7e-131 *->mkwVMkLLLLaALagvfgaAErdAfsvgkCrvpsPPRGfrVkeNFDv mkwV++LLLLaA + +aAErd Crv+s frVkeNFD+ gi|5803139 1 MKWVWALLLLAA--W--AAAERD------CRVSS----FRVKENFDK 33

erylGtWYeIaKkDprFErGLllqdkItAeySleEhGsMsataeGrirVL +r++GtWY++aKkDp E GL+lqd+I+Ae+S++E+G+Msata+Gr+r+L gi|5803139 34 ARFSGTWYAMAKKDP--E-GLFLQDNIVAEFSVDETGQMSATAKGRVRLL 80

eNkelcADkvGTvtqiEGeasevfLtadPaklklKyaGvaSflqpGfddy +N+++cAD+vGT+t++E dPak+k+Ky+GvaSflq+G+dd+ gi|5803139 81 NNWDVCADMVGTFTDTE----------DPAKFKMKYWGVASFLQKGNDDH 120

Page 85: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

Two kinds of multiple sequence alignment resources

Text-based or query-based searches:CDD, Pfam (profile HMMs), PROSITE

[2] Multiple sequence alignment programs

Muscle, ClustalW, ClustalX

[1] Databases of multiple sequence alignments

Page 329

Page 86: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

BLOCKSCDD Pfam SMARTDOMO (Gapped MSA)INTERPROiProClassMetaFAMPRINTSPRODOM (PSI-BLAST)PROSITE

Databases of multiple sequence alignments

TheseUseHMMs

Page 87: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 87

Multiple sequence alignment programs

• AMAS• CINEMA• ClustalW• ClustalX• DIALIGN• HMMT• Match-Box• MultAlin• MSA• Musca• PileUp• SAGA• T-COFFEE

Page 88: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 88

Multiple sequence alignment algorithms

Progressive

Iterative

Local Global

PIMA

DIALIGN SAGA

CLUSTALPileUpother

Page 89: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 89

performance of alignment programs depends on (McClure et al., 1994)

• the number of sequences,

• the degree of similarity between sequences

• the number of insertions in the alignment.

• the length of the sequences

• the existence of large insertions and N/C-terminal extensions

• over-representation of some members of the protein family.

Page 90: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 90

Strategy for assessment of alternativemultiple sequence alignment algorithms

• [1] Create or obtain a database of protein sequences for which the 3D structure is known. Thus we can define “true” homologs using structural criteria.

• [2] Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers).

• [3] Compare the answers.

Page 91: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 91

BAliBASE: A benchmark alignments database for the evaluation of multiple

sequence alignment programs

• BAliBASE is a database of manually-refined multiple sequence alignments specifically designed for the evaluation and comparison of multiple sequence alignment programs. The alignments are categorised by sequence length, similarity, and presence of insertions and N/C- terminal extensions. Core blocks are identified excluding non-superposable regions.

• http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/

Page 92: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 92

BaliBase

• Thompson et al., 1999, Nuc. Acids. Res. 27, 2682-2690).

• DIALIGN was found to be the best method for local multiple alignment.

• CLUSTAL W, PRRP and SAGA were superior on globally related sequence sets

Page 93: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 93

Conclusions: assessment of alternativemultiple sequence alignment algorithms

• [1] As percent identity among proteins drops, performance (accuracy) declines also. This is especially severe for proteins < 25% identity.– Proteins <25% identity: 65% of residues

align well– Proteins <40% identity: 80% of residues

align well

Page 94: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 94

Conclusions: assessment of alternativemultiple sequence alignment algorithms

• [2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.

Page 95: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 95

Conclusions: assessment of alternativemultiple sequence alignment algorithms

• [3] Separate multiple sequence alignments can be combined (e.g. RBPs and lactoglobulins).– Iterative algorithms (PRRP, SAGA)

outperform progressive alignments (ClustalX)

Page 96: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.

04/20/23 96

Conclusions: assessment of alternativemultiple sequence alignment algorithms

• [4] When proteins have large N-terminal or C-terminal extensions, local alignment algorithms are superior. PileUp (global) is an exception.