Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple...

1

Multiple Sequence Alignment

Some slides from: • Jones, Pevzner, USC Intro to Bioinformatics Algorithms

http://www.bioalgorithms.info/ • S. Batzoglu, Stanford http://ai.stanford.edu/~serafim/CS262_2006/

• Geiger, Wexler, Technion http://www.cs.technion.ac.il/~cs236522/

• Ruzzo, Tompa U. Washington CSE 590bi • Poch, Strasbourg www.inra.fr/internet/Projets/agroBI/PHYLO/Poch.ppt

• A. Drummond, Auckland, NZ • Tandy Warnow, Illinois tandy.cs.illinois.edu/Jordan-april14.pptx

CG © Ron Shamir

http://www.tau.ac.il/�

2

Multiple Alignment vs. Pairwise Alignment

• Up until now we have only tried to align two sequences.

• What about more than two? And what for?

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal CG © Ron Shamir


http://www.aldeaeducativa.com/small/venter.jpg�

http://images.google.com/imgres?imgurl=http://show.docjava.com:8086/book/dsp/data/images/gifs/baboon.GIF&imgrefurl=http://show.docjava.com:8086/book/dsp/data/images/gifs/&h=512&w=512&sz=258&tbnid=jKpgkJ-c2OYJ:&tbnh=128&tbnw=128&start=13&prev=/images?q=baboon&hl=en&lr=&sa=N�

http://images.google.com/imgres?imgurl=http://news.bbc.co.uk/olmedia/1765000/images/_1768316_tiger2.jpg&imgrefurl=http://ai-o-artista.blogspot.com/&h=300&w=300&sz=14&tbnid=bjq3xD6A-2AJ:&tbnh=111&tbnw=111&start=6&prev=/images?q=tiger&hl=en&lr=&sa=G�

http://images.google.com/imgres?imgurl=http://www.nature.com/news/2004/041206/images/chicken.jpg&imgrefurl=http://www.nature.com/news/2004/041206/full/041206-8.html&h=166&w=180&sz=25&tbnid=TbBpwwkh5PQJ:&tbnh=88&tbnw=95&start=16&prev=/images?q=chicken&hl=en&lr=&sa=G�

3

Multiple Alignment Definition Input: Sequences S1 , S2 ,…, Sk over the same alphabet Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length

1. |S’1|= |S’2|=…= |S’k|

2. Removal of spaces from S’i gives Si for all i

CG © Ron Shamir


4

Example

S1=AGGTC

S2=GTTCG

S3=TGAAC Possible alignment

A - T

G G G

G - -

T T A

- T A

C C C

- G -

Possible alignment

A G -

G T T

G T G

T - A

- - A

C C A

- G C

CG © Ron Shamir


5

CG © Ron Shamir


6 CG © Ron Shamir


7

Example

Multiple sequence alignment of 7 neuroglobins using clustalx

CG © Ron Shamir


8

Human-centric beta globin Multiple Alignment

http://globin.cse.psu.edu/ CG © Ron Shamir


http://images.google.com/imgres?imgurl=http://blogdosbichos.blogs.sapo.pt/arquivo/galago.gif&imgrefurl=http://blogdosbichos.blogs.sapo.pt/2004/04/&h=320&w=480&sz=97&hl=iw&start=3&tbnid=nfPknvoy2u1asM:&tbnh=86&tbnw=129&prev=/images?q=galago&svnum=10&hl=iw&lr=&rls=GBSA,GBSA:2006-34,GBSA:en&sa=N�

9

Protein Phylogenies – Example

Kinase domain

CG © Ron Shamir


10

Motivation again • Common structure, function or origin may

be only weakly reflected in sequence – multiple comparisons may highlight weak signals

• Major uses: – Identify and represent protein families – Identify and represent conserved seq. elements

(e.g. domains) – Deduce evolutionary history – PCR primer design – Transcription factor binding sites

CG © Ron Shamir


Structure comparison, modelling

Interaction networks

Hierarchical function annotation: homologs, domains, motifs

Phylogenetic studies

Human genetics, SNPs

Therapeutics, drug discovery

Therapeutics, drug design DBD

LBD

insertion domain

binding sites / mutations

Gene identification, validation

RNA sequence, structure, function

Comparative genomics

MACS

MSA : central role in biology

CG © Ron Shamir

12

Scoring alignments

• Given input seqs. S1 , S2 ,…, Sk find a multiple alignment of optimal score

• Scores preview: – Sum of pairs – Consensus – Tree

• Varying methods - no single winner

CG © Ron Shamir


13

Sum of Pairs score Def: Induced pairwise alignment A pairwise alignment induced by the multiple

alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAG; z: GCCGC-GAG; z: GCCGCGAG

S(M) = Σk<l σ(S’k, S’l) CG © Ron Shamir


14

Consider the following alignment:

AC-CDB- -C-ADBD A-BCDAD

SOP Score Example

Scoring scheme: match - 0 mismatch/indel - -1 SP score: -3 -5 -4 =-12

CG © Ron Shamir


15

Alignments = Paths

• Align 3 sequences: ATGC, AATC,ATGC

A A T -- C

A -- T G C

-- A T G C

CG © Ron Shamir


16

Alignment Paths 0 1 1 2 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

CG © Ron Shamir


17

Alignment Paths • Align 3 sequences: ATGC, AATC,ATGC 0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

-- A T G C

•

x coordinate

y coordinate

CG © Ron Shamir


18

Alignment Paths

0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

0 0 1 2 3 4

-- A T G C

• Resulting path in (x,y,z) space: (0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

x coordinate

y coordinate

z coordinate

CG © Ron Shamir


19

Aligning Three Sequences • Same strategy as

aligning two sequences

• Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align

• For global alignments, go from source to sink

source

sink

CG © Ron Shamir


20

2-D vs 3-D Alignment Grid

V

W

2-D edit graph

3-D edit graph

CG © Ron Shamir


21

2-D cell versus 2-D Alignment Cell

In 3-D, 7 edges in each unit cube

In 2-D, 3 edges in each unit square

CG © Ron Shamir


22

Architecture of 3-D Alignment Cell

(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k) (i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

CG © Ron Shamir


23

Multiple Alignment: Dynamic Programming

• si,j,k = max

• δ(x, y, z) is an entry in the 3-D scoring matrix (e.g., using SOP score)

si-1,j-1,k-1 + δ(vi, wj, uk) si-1,j-1,k + δ (vi, wj, _ ) si-1,j,k-1 + δ (vi, _, uk) si,j-1,k-1 + δ (_, wj, uk) si-1,j,k + δ (vi, _ , _) si,j-1,k + δ (_, wj, _) si,j,k-1 + δ (_, _, uk)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

CG © Ron Shamir


24

Running Time

• For 3 sequences of length n, the run time is O(n3)

• For k sequences, build a k-dimensional cube, with run time O(2knk)

• Want n, k 10’s to 100’s (sometimes much more)

• Impractical CG © Ron Shamir


25

Scoring gaps • Our alg assumed a linear gap score (const.

per space). • For affine gap scores: need 2k – 1 states,

one per combination of gapped/ungapped sequences

• Running time: O(2k × 2k × nk) = O(4k nk)

CG © Ron Shamir


26

NP- Hardness of Multiple Alignment

• For fixed k, the problem is polynomial • The general problem for SOP:

• Wang & Jiang 94: special type matrix • Bonizzoni & Vedova 01: more general

matrix, binary alphabet • Elias 03: binary or larger alphabet, any

matrix • Similar results for other scoring

functions

CG © Ron Shamir


27

Forward Dynamic Programming • An alternative approach to DP for pairwise

(and multiple) alignment: • D(v) – opt value of path sourcev • p(w) – best-yet solution of path sourcew • When D(v) is computed, send its value

forward on all the arcs (v,w): p(w)=min{p(w),D(v)+cost(v,w)}

• Once p(w) has been updated by all incoming edges – that value is optimal; set as D(w)

CG © Ron Shamir


28

Forward Dynamic Programming (2)

• Maintain a queue of nodes whose D is not set yet

• For the node w at the head of the queue: Set D(w)p(w) and remove

• Update p for each out-neighbor x of w – if x is not in the queue – add it at the end*

• Same complexity as the regular (backwards) DP

• *breaking ties by lexicographically!

CG © Ron Shamir


29

Faster DP Algorithm for MultiAlign Carillo-Lipman 88

• Use forward DP. • We’ll demonstrate on three sequences • f12(i,j) = opt pairwise alignment score of

suffixes S1(i,..n1), S2(j,..n2), etc. • Key idea: if ∃ a known soln of cost z, if D(i,j,k)+ f12(i,j) + f13(i,k) +f23(j,k) > z Do not place (i,j,k) in the queue • Extra work: DP for all seq pairs - O(k2n2) • Guarantees opt soln – no improved time

bound, but often saves a lot in practice.

CG © Ron Shamir


30

http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/cl2.gif CG © Ron Shamir


31

Approximation Algorithm – SOP

We use min cost instead of max score

Find alignment of minimal cost

Assumption: the cost function δ is a distance function

• δ(x,x) = 0

• δ(x,y) = δ(y,x) ≥ 0

• δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels)

D(S,T) - cost of minimum global alignment between S and T

CG © Ron Shamir


32

Input: Γ - set of k strings S1, …,Sk.

1. Find the string S*∈ Γ (center) that minimizes 2. Denote S1=S* and the rest of the strings as S2, …,Sk

3. Iteratively add S2, …,Sk to the alignment as follows:

a. Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1

b. Optimally align Si to S’1 to produce S’i and S’’1 aligned

c. Adjust S’2, …,S’i-1 by adding spaces where spaces were added to S’’1 d. Replace S’1 by S’’1

( ){ }

∑Γ∈ *\

*,SS

SSD

The Center Star algorithm

CG © Ron Shamir


33

• Choosing S1 – execute DP for all sequence-pairs - O(k2n2)

• Adding Si to the alignment - execute DP for Si , S’1 - O(i·n2).

(In the ith stage the length of S’1 can be up-to i· n)

( ) ( )∑−

=

=⋅1

1

222k

inkOniO

Running time

total complexity

CG © Ron Shamir


34

For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Approximation ratio

• M* - An optimal alignment • M - The alignment produced by this algorithm • d(i,j) - The distance M induces on the pair Si,Sj •

•recall D(S,T) – min cost of alignment between S and T

( ) ( ) ( )∑∑∑<=

≠=

==ji

k

i

k

ijj

jidjidMv ,2,1 1

CG © Ron Shamir


35

Approximation ratio (2)

( )∑=

−=k

llSSDk

21,)1(2

( )∑=

=k

jjSSDk

21,

2)1(2)()(

* ≤−

≤k

kMvMv

( ) ( )∑∑=

≠=

=k

i

k

ijj

jidMv1 1

, ( ) ( )( )∑∑=

≠=

+≤k

i

k

ijj

jdid1 1

,1,1

( )∑=

−=k

lldk

2,1)1(2

( ) ( )∑∑=

≠=

=k

i

k

ijj

jidMv1 1

** , ( ) ≥≥ ∑∑=

≠=

k

i

k

ijj

ji SSD1 1

,

( )∑∑= =

≥k

i

k

jjSSD

1 21 ,

Definition of S1:

( ) ( )∑∑≠==

≤∀k

ijj

ji

k

jj SSDSSDi

121 ,,:

Triangle inequality + symmetry

CG © Ron Shamir


36

Theorem (Gusfield 93)

• We have proved: • The center star algorithm is a

polynomial algorithm that guarantees a solution at most twice the optimum.

• “a 2-approximation” • “an approximation ratio of 2”

CG © Ron Shamir


37

Steiner/consensus string • Input: Γ - set of k strings S1, …,Sk. • D(X,Y) – score of aligning X, Y.

• The consensus error of S relative to Γ:

E(S)= Σi≤k D(S, Si) • Goal: find S* that minimizes E(S) • S* is called an optimal Steiner string for Γ

• New objective – k terms and not k2

• No direct relation to multialign! (for now)

CG © Ron Shamir


38

Thm: Assume D satisfies triangle ineq. Then ∃S∈Γ that guarantees an approximation ratio 2.

∑ ≠=

iSS iSSDSE ),()( ( ) ( )( )∑ ≠+≤

iSS iSSDSSD *,*,

∑ ≠++−=

SS ii

SSDSSDSSDk )*,(*),(*),()2(

*)(*),()2( SESSDk +−=Pick S ∈Γ closest to S* (not constructively)

∑ Γ∈=

iS iSSDSE )*,(*)( *),( SSDk ⋅≥

Pf: Pick S ∈Γ

Approximation

21)2()()(

* <+−

≤k

kSESE

CG © Ron Shamir


40

Consensus string from MA • The consensus character in column i of MA is the char

that minimizes the sum of distances to it from all the chars in column i

• The consensus string of a MA is obtained by taking the consensus character in each position.

• Ex: match =0, mismatch/space=1 consensus=majority char • S*: AC-GC-GAG • x: AC-GCGG-C • y: AC-GC-GAG • z: GCCGA-GAG • u: AC-T-GGCA • v: -CAGT-GAG • w: AC-GC-GAG

Alignment error: S(M) = Σk σ(S’k, S*)

The opt consensus MA: one with least alignment error

Thm: opt consensus MA = Steiner string

(up to spaces)

CG © Ron Shamir


41

Tree MA • Input: Tree T, a string for each leaf • Phylogenetic alignment for T: Assignment of

a string to each internal node

• Score – (weighted) sum of scores along edges • Goal: find phyl. alignment of optimal score • Consensus = phyl. Alignment where T is a star

CTGG

CCGG

GTTC CTTG

GTTG

GTTG

CTGG

CG © Ron Shamir


42

Tree MA – complexity

• Also NP-hard • Poly time approximations:

– 2-approximation – Better approximation with more time

(PTAS)

CG © Ron Shamir


43

Profile Representation of MA - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4

• Alternatively, use log odds: • pi(a) = fraction of a’s in col i • p(a) = fraction of a’s overall • log pi(a)/p(a)

CG © Ron Shamir


45

Aligning a sequence to a profile

• Key in pairwise alignment is scoring two positions x,y: σ(x,y)

• For a letter x and a column y in a profile, σ(x,y)=value of x in col. Y

• Invent a score for σ(x,-) • Run the DP alg for pairwise alignment

CG © Ron Shamir


46

Aligning alignments

• Given two alignments, how can we align them?

• Hint: use DP on the corresponding profiles.

x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2 v GGACCT-----

x GGGCACTGCAT y GGTTACGTC-- z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

CG © Ron Shamir


47

Multiple Alignment: Greedy Heuristic

• Choose most similar pair of sequences and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat

u1= ACGTACGTACGT…

u2 = TTAATTAATTAA…

u3 = ACTACTACTACT…

…

uk = CCGGCCGGCCGG

u1= ACg/tTACg/tTACg/cT…

u2 = TTAATTAATTAA…

…

uk = CCGGCCGGCCGG… k

k-1

CG © Ron Shamir


48

Progressive Alignment • A variation of greedy algorithm with a

somewhat more intelligent strategy for choosing the order of alignments.

CG © Ron Shamir


49

Progressive alignment

Align sequences (pairwise) in some (greedy) order

Decisions (1) Order of alignments (2) Alignment of sequence to group (only), or allow group to group (3) Method of alignment, and scoring function

CG © Ron Shamir


64

ClustalW Thompson, Higgins, Gibson 94

• Popular multiple alignment tool today • ‘W’ = ‘weighted’ (different parts of

alignment are weighted differently). • Three-step process

1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive alignment guided by

the tree

CG © Ron Shamir


65

Step 1: Pairwise Alignment • Aligns each sequence against each

other giving a similarity matrix • Similarity = exact matches / sequence

length (percent identity) v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 -

(.17 means 17 % identical)

CG © Ron Shamir


66

Step 2: Guide Tree • Use the similarity method to create a

Guide Tree by applying some clustering method*

• Guide tree roughly reflects

evolutionary relations

• *ClustalW uses the neighbor-joining method (to be described later in the course)

CG © Ron Shamir


67

Step 2: Guide Tree (cont’d)

v1 v3 v4 v2

Calculate: v1,3 = alignment (v1, v3) v1,3,4 = alignment((v1,3),v4) v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 -

CG © Ron Shamir


68

Step 3: Progressive Alignment • Start by aligning the two most similar

sequences • Using the guide tree, add in the most similar

pair (seq-seq, seq-prof or prof-prof) • Insert gaps as necessary • Many ad-hoc rules: weighting, different

matrices, special gap scores…. FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is. CG © Ron Shamir


MAFFT – MultiAlign via FFT Katoh et al, NAR 02

• Overview: – Homology identification using FFT – Alignment Scoring/Selection

• Faster computation due to

– O(NlogN) homology detection using FFT – Simpler-to-compute scoring function

69


Signal and pairwise correlation • For amino acid sequences

– [volume v(a), polarity p(a)] per AA a – Each standardized (mean 0, std 1) of all AAs

• Correlation between seqs 1,2 at offset k:

• Direct computation for all k O(N2) for seqs of length ~N. With FFT O(NlogN)

• c(k) = cv(k) + cp(k)

70


Pairwise alignment • High c(k) gives offset, not location

• Use sliding windows to find 20 highest peaks, merging if necessary.

• Result: ≤20 homologous segments

71


Pairwise alignment (2) • Use regular DP to

find highest scoring chain of segments

• Extend to full alignment constrained to contain the chain (within the white

72


Group-group alignment For column n in group1, weight the

sequences to get average values of v, p Apply the same procedure to get group1-

group2 alignment

73


Overall alg • Compute pairwise distances (quickly) • Form guide tree T1 • Progressive alignment of sequences in

bottom up order of T1 • Infer a new guide tree T2 from the

multialignment • Realign each input seq along T2 • …. • x100 faster than prev algs, same

accuracy 74


78

Multiple Alignment: History 1 975 Sankoff Formulated multiple alignment problem

and gave dynamic programming solution 1 988 Carrillo- Lipman Branch and Bound approach for MSA 1 990 Feng- Doolittle Progressive alignment 1 994 Thompson- Higgins- Gibson- ClustalW Most popular multiple alignment program 1 998 Morgenstern et al. - DIALIGN Segment-based multiple alignment 2000 Notredame- Higgins- Heringa- T- coffee Using the library of pairwise alignments 2002 MAFFT 2004 MUSCLE 2005 ProbCons

…… Still a lot to be done! CG © Ron Shamir


Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple...

Documents

Transcript of Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple...