Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores...

Multiple Sequence Alignment

With thanks to Eric Stone and Steffen Heber,North Carolina State University

Definition: Multiple sequence alignment

ATTTG-ATTTGCAT-TGC

ATTTGATTTGCATT-GC

ATTT-G-ATTT-GCAT-T-GC

alignment no alignmentno alignment

• Given a set of sequences, a multiple sequence alignment is an assignment of gap characters such that– the resulting sequences have the same length

– no column contains only gaps

Application: Characterize protein families

Application: Discover conserved pattern

• A faint similarity between two sequences may become detectable if present in many

Application: Recover phylogenetic tree

Pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

Multiple sequence alignment (MSA)

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADSgi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADSgi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSgi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADSgi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS

• Generalizes pairwise sequence alignment (PSA)– Multiple simply means three or more sequences

• In PSA, paired residues assumed to be homologous– In MSA, columns of residues assumed to be homologous

• Are there fundamental differences between MSA and PSA?– What distinguishes MSA from PSA?

Biological distinction

• Homology (common ancestry) in PSA is symmetric

• Phylogeny renders homology in MSA asymmetric

…TTACG……TTGCG…

A G AG

…TTACG……TTACG……TTGCG……TTGCG… A G

≠A G A GG A

MSA makes a statement about homology

• Like pairwise sequence alignment– Multiple sequence alignment asserts the homology of its columns

• Unlike pairwise sequence alignment, interpretation requires a phylogeny

NYLS NFLS

Statistical distinction

a b a b

M R( )( ) ba

RbaMba

=|,Pr|,Pr

• PSA compares a "match model" M to a "random model" R

• In MSA, how do we define M without phylogeny?

• How were PSA match probabilities pab obtained?– Can something similar be done for MSA?

Computational distinction

• Even for PSA, exhaustive search is exponential– But optimal PSA can be found by DP in O(mn)

• Clearly MSA has higher complexity than PSA– But by how much?

• Is finding the optimal MSA still feasible?

F(i-1, j-1) F(i, j-1)

F(i-1,j) F(i, j)

s(xi ,yj)

The problem of global MSA

• Given N sequences: x1, x2,…, xk

• Insert gaps (-) in each sequence xi such that– All sequences have the same length L

– Score of the global map is maximum

• MSA is more sensitive than PSA– A faint similarity between two sequences may become detectable if

present in many

• More sequences may increase alignment quality– But the cost is added complexity

How can alignment columns be scored?

• In pairwise sequence alignment– Scores quantify the exchangeability of the residues/gap in the pair

• In multiple sequence alignment– A similar treatment is more complicated and requires use of phylogeny

• One solution:– Evaluate MSA through its constituent PSAs

• These PSAs are called induced pairwise alignments

Induced pairwise alignments

• Example MSA:x:AC-GCGG-Cy:AC-GC-GAGz:GCCGC-GAG

• Induces three PSAs:

x:ACGCGG-C x:AC-GCGG-C y:AC-GCGAG

y:ACGC-GAC z:GCCGC-GAG z:GCCGCGAG

• The MSA can be scored by summing over the induced PSAs– This is called the “Sum-of-pairs” approach

Example: Sum-of-pairs score

F Y G D

F 5 -2 -2 -1

Y 7 1 -5

G 4 -3

x: F-Gy: F-Gz: FYD

Gap penalty: -8

BLOSUM 60

x: FGy: FG x: F-G

z: FYD

y: F-Gz: FYD

5 + 4 = 95 - 8 - 3 = -6

5 - 8 - 3 = -6

• Sum-of-pairs score: 9 - 6 - 6 = -3– What is the computational complexity?

Problem: Finding the optimal MSA

• Given k sequences: x1, x2,…, xk

• Sum-of-pairs score for any MSA of k sequences is– Sum of scores of all k(k-1)/2 induced PSAs

• Seek MSA A which maximizes sum-of-pairs scoreS(A) = Σi<j S(Aij)

where S(Aij) is the score of the Aij, the PSA of sequences xi

and xj induced by the MSA A

• Clearly exhaustive search is not an option– Can we rely on dynamic programming?

Dynamic Programming

• Similar to pairwise alignments, multiple sequence alignments can be computed by dynamic programming

Generalized Needleman-Wunsch

• Given 3 sequences x, y, and z

• Main iteration loop:

F(i,j,k) = max{ F(i-1, j-1, k-1) + S(xi, yj, zk),F(i-1, j-1, k ) + S(xi, yj, - ),F(i-1, j , k-1) + S(xi, -, zk),F(i-1, j , k ) + S(xi, -, - ),F(i , j-1, k-1) + S( -, yj, zk),F(i , j-1, k ) + S( -, yj, zk),F(i , j , k-1) + S( -, -, zk) }

Analysis of algorithm

• Given k sequences of length n:– Space for matrix: O(nk)

– Neighbors/cell: 2k-1

– Time to compute SP score: O(k2)

– Overall runtime: O(k22knk)

• Implications– Can align about 7 relatively short (length 200 - 300) sequences in a

reasonable amount of time

• 27 2007 > 1,600,000,000,000,000,000– Exact optimality is generally not attainable

Heuristics for multiple sequence alignment

• Exact optimality is too slow, even by dynamic programming

• Alternative:– Seek good suboptimal solutions that are attainable in reasonable time

• Key questions:– What is a “good” suboptimal solution?

– What is “reasonable” time?

• Heuristics focus on an intelligent reduction of search space– Divide-and-conquer alignment

– Greedy alignment (progressive)

Divide-and-conquer alignment (DCA)

• Idea: Reduce search space for dynamic programming by cutting the sequences.

• Algorithm:1. Cut sequences into fragments

until fragments can be aligned by DP

2. Build multiple alignments by DP

3. Concatenate the resulting alignments

Sequence 3

Cut points optimize: C = Sprefix + Ssuffix - Scomplete

Reduction of search space

Greater reduction of search space

Progressive alignment

• Idea: – Build multiple sequence alignment from a series of pairwise alignments

• Strategy:– Choose two sequences to align (optimally)

– Hold pairwise alignment fixed, treat as a new sequence, and iterate

• For n sequences:– Requires n -1 pairwise sequence alignments

• Does the order matter?– What criteria are used to choose the sequences?

Guide tree

• Binary tree– Leaves correspond to sequences

– Internal nodes represent alignments

– Root corresponds to final MSA

• The guide tree specifies theorder of alignment

• Usually constructed from matrix of pairwise distances between sequences

ATC ATG TCG

ATCATG

ATC-ATG-

TCGTCC

Simple approach to distance matrix D

• Example sequences:

A ACGCGTTGGGCGATGGCAAC

B ACGCGTTGGGCGACGGTAAT

C ACGCATTGAATGATGATAAT

D ACACATTGAGTGATAATAAT

• Simple approach– Count mismatches

• Pairwise distance matrix:

From pairwise distances to a tree

• Using this information, a tree can be drawn:

A ACGCGTTGGGCGATGGCAAC

B ACGCGTTGGGCGACGGTAAT

C ACGCATTGAATGATGATAAT

D ACACATTGAGTGATAATAAT

• Is it guaranteed that the distances exactly fit a tree?

Guide tree

Progressive alignment

• Follow branching order of guide tree to build MSA

• Problem: We may have to align– Two sequences– A sequence and an alignment– Two alignments

Guide Tree

How to align two alignments?

Idea: Dynamic programming; treat columns like single positions.

Example:

a = GTCGTA

b = GTTGTT

GTCGTAGTTGTT

GT-CGT-AGTT-GTT-

GTT-GTT-GT-TGT-T

Align a[3]and b[3]

Align a[3]and gap

Align b[3]and gap F

1 PEEKSAVTAL2 GEEKAAVLAL3 PADKTNVKAA4 AADKTNVKAA

5 EGEWGLVLHV6 AAEKTKIRSA

Score: [ ]I)2s(K,V)2s(K, + I)s(L, + V)s(L,+ I)s(T, + V)s(T,8

Scoring an alignment of alignments

• Average over all possibilities, possibly weighted– Generalizes PSA of two sequences to two profiles

Complexity of progressive alignment

The time required to align k sequences of length n:

• For progressive alignment: O(k2n2)

• Compare with dynamic programming: O(k22knk)

• BUT– Is there any guarantee on the quality of the progressive MSA?

Progressive alignments with ClustalW

• Clustal is the most popular method of MSA

ClustalW: Overview

Pairwise Alignments

Guide Tree

progressive alignment

1 2 3 4 5

Distance Matrix

1. Compute pairwise alignments (DP)2. Convert similarities into distances3. Build guide tree from distances by

Neighbor Joining4. Align with respect to guide tree

ClustalW server at EBI

http://www2.ebi.ac.uk/clustalw/

Input: Sequences in FASTA format

• ClustalW example:– RBP protein sequences from five species:

– Human, mouse, rat, cow, pig

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

Input:Five RBP sequences

best score is alignment between rat and mouse

Generate all 10 PSAs

• Highest scoring alignment ⇒ closest sequences on guide tree

((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);

Use PSA scores to create guide tree

Rat RBP

Mouse RBP

Pig RBP

Cow RBP

Human RBP

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

Progressively align sequences

• Make a MSA based on the order in the guide tree– Start with the two most closely related sequences

– Then add the next closest sequence

– Continue until all sequences are added to the MSA

• Rule: “once a gap, always a gap.”

Progressive MSA

ClustalW: “once a gap, always a gap”

x:ACGCGGCy:ACGC-GC

x:ACGCGGCy:ACTT-TC

Closely related Distantly related

• There are many possible ways to make a MSA– Where gaps are added is a critical question

• In which case are gap locations most reliable?

• Gaps often added to the first two sequences– To maintain the initial gap choices is to trust that those gaps are most

believable

CLUSTAL W (1.82) multiple sequence alignment

gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50

********************:* ***:*****

gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100

*********:*******.*:************.**:**************

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150

****************:*******:****:*:* ****** *********

* asterisks indicate identity in a column

Output: MSA of 5 RBP sequences

ClustalW has sophisticated gap treatment

• Gap opening an extension penalty dependent on:– Weight matrix

– Sequence similarity, length, difference in sequence length

– Position of gaps and residues at gaps

• Motivation: – If positions known of all secondary structure elements (α-helices, β-

strands) in all or some of the sequences

– Could increase the gap penalties inside ss elements and decrease outside them

– Forcing gaps to occur most often in loop regions

Sequence weighting in ClustalW

• Choose root such that mean of branch lengths on either side are equal

• For each sequence compute distance from root

• Adjust branches that are used several times

• Use distances as weight factors in SP score

0.30.2

wA = 0.2 + 0.3/2 = 0.35

wB = 0.1 + 0.3/2 = 0.25

wC = 0.6

guide tree

Weighted sum-of-pairs score

• Sequence pairs may be assigned weights to reduce the influence of very similar sequences on the alignment score

• This leads to a weighted sum-of-pairs score (WSP):

( ) ( )∑<

lkkli mmswmWSP ,

weight factor

Additional features of ClustalW

• Individual weights are assigned to sequences– Closely related ⇒ less weight

• Scoring matrices are varied depending on the presence of conserved or divergent sequences, e.g.

PAM20 80-100% id

PAM60 60-80% id

PAM120 40-60% id

PAM250 0-40% id

Shortcomings of progressive approaches

• Progressive MSA strongly dependent upon initial alignments

• If sequences aligned at each step are similar– Progressive approach works well

• If MSA is built on dubious PSAs– Errors in alignment propagated and amplified

• Post-processing solution:– Iterative refinement

Dangers of progressive alignment

Frozen by initial PSA

Additional sequences makeclear that y: GA-CTT

• Initial alignments are “frozen” even when new evidence is introduced

• Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Iterative refinement of progressive MSA

x,z fixed projection

allow y to vary

• For each j = 1 to N– Remove sequence xj and realign to remaining alignment of x1,…,xj-

1,xj+1,…,xN

• Repeat until alignment converges

Ex: Iterative refinement

• Progressive alignment (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

• After realigning y to the remainder:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Evaluating methods of MSA

• Since DP for MSA is prohibitively slow– Heuristic methods are required

• Heuristics vary in both speed and accuracy

• Speed is well quantified by computational complexity– How can accuracy be quantified?

• Idea: Construct benchmark sets of “correct” MSAs– Evaluate ability of methods to reconstruct “correct” alignment

• Metric: Q = proportion of correctly aligned residues

Performance of different alignment tools

BAliBASE(237)

PREFAB(1932)

SABmark(698)

Algorithm

Q tt tQ

12:25:00

2:57:00

2:36:00

144:51:00

3:11:000.648

0.668 19:41:00

Align-m 0.352 56:44

DIALIGN 0.410 8:28

CLUSTALW 0.439 2:16

MAFFT 0.442 7:33

T-Coffee 0.456 59:10

MUSCLE 0.464 20:42

PROBCONS 0.505 17:20

• ClustalW: Most widely used– http://www.ebi.ac.uk/clustalw/

• T-Coffee: Better but slower– http://www.ch.embnet.org/software/TCoffee.html

• ProbCons: Most accurate– http://probcons.stanford.edu/

• MUSCLE: Most scalable– http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

Popular heuristic methods for MSA

Summary

• Multiple alignments make a statement about homology

• Optimal solution can be found by DP, but prohibitively slow

• Faster heuristics necessary for most applications

• Most heuristic methods based on progressive method

• Variants include pre-processing and post-processing

• Choice of method dictated by tradeoffs of time and accuracy

Postscript: The chicken and the egg

DGMNAGLAQ-VIADGM-ASLAQGVI------SIPGVDK-phylogenetic tree

initial alignment

refine alignment usingweighted SP score

compute sequence weights

estimate phylogenetic tree

Alignment ↔ phylogeny ↔ alignment ↔ ...

pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

• Didn’t we use the tree to build the alignment?– How can we use the alignment to build the tree?

Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores...

Documents

Transcript of Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores...

Adsorption and Exchangeability of Fibronectin and Serum … · 2020. 11. 27. · Adsorption and Exchangeability of Fibronectin and Serum Albumin Protein Corona on Annealed Polyelectrolyte

Package ‘basket’ · Package ‘basket’ April 7, 2020 Title Basket Trial Analysis Version 0.10.2 Description Implementation of multisource exchangeability models for Bayesian

CALCIUM EXCHANGEABILITY IN SUBCELLULAR FRACTIONS OF ... · Ravazolla et al. 1976; Schafer & Kloppel, 19746) a precis. Thue knowledgs oef the location, exchangeability an, capacity

Sufficiency, Partial Exchangeability, and Exponential Families

Rio de Janeiro, Brazil - fertilizer.org · • Fixation • Potential leaching only on extremely sandy soils Potassium, K + • Exchange Reactions • Specific adsorption • Non-exchangeability

Aldous Exchangeability

Analyzing Basket Trials under Multisource Exchangeability ...

yoshida/paper/msa.pdf · distributed safety properties could be formalised and studied. Building a connection between communicating automata and session types …

Change Detection in Data Streams by Testing Exchangeability

CFP MSA Hardware Specification - FluxLight MSA.pdf · 2016-06-14 · CFP MSA Hardware Specification Revision 1.4 7 June 2010 Editor: Matt Traverso, Opnext, Inc. Description: This

Sequence Models Scenarios Scenarios Sequence Diagram Sequence Diagram Guidelines for Sequence Models Guidelines for Sequence Models.

Exchangeability, Braidability and Quantum Independence

Korean Peninsula Differences & Exchangeability Between South and North Korea April 2. 2005 B.A You.

Marco La Rosa, ESN. Why? ExchangeAbility, UNICA Grant | Marco La Rosa | CND Siena 20112.

Exercises (Sequence databases, sequence alignment ...

The Theory of Spectrum Exchangeability Howarth, E. and Paris, J.B. 2015 MIMS … · 2019. 5. 11. · The Theory of Spectrum Exchangeability E.Howarthyand J.B.Paris School of Mathematics

Exchangeability Martingales for Selecting Features in Anomaly …proceedings.mlr.press/v91/cherubin18a/cherubin18a.pdf · 2018-06-06 · Exchangeability Martingales for Selecting

DNAMAN Sequence Analysis Software Sequence Search and ... · Dotplot Analysis 1. Name of Sequence 1 2. Name of Sequence 2 3. Annotations of Sequence 1 4. Annotations of Sequence 2

epub.ub.uni-muenchen.de · 3.5 Diagnostic plots for checking the model assumptions . . . . . . . . . . . . . . . . .29 3.5.1 Checking exchangeability of replications ...

Plug-in martingales for testing exchangeability on-line · Alex Gammerman alex@cs.rhul.ac.uk Ilia Nouretdinov ilia@cs.rhul.ac.uk Vladimir Vovk v.vovk@rhul.ac.uk Computer Learning