CS 5263 Bioinformatics

CS 5263 Bioinformatics

Lecture 8: Multiple Sequence Alignment

Roadmap

• Homework?

• Review of last lecture

• Multiple sequence alignment

Homework

• #1: dsDNA => mRNA => protein

Coding strand

Template strand

mRNA

Template strand

mRNA

protein

The genetic code

Problem #2

• For two strings of lengths m and n, the number of alignment is equal to the number of paths from (0, 0) to (m, n)– How many ways we can get to (i, j) depend on how

many ways we can get to its preceding neighbors

Problem #3

• Similar to problem #2• But there are some limitations on certain paths

– (i-1, j-1)→(i-1, j)→(i, j) is illegal– So is (i-1, j-1)→(i, j-1)→(i, j)

• How many ways to get to (i-1, j) without using (i-1, j-1)→(i-1, j)?• How many ways to get to (i, j-1) without using (i-1, j-1)→(i, j-1)?

Problem #4

• Implementation is easy• Histogram: how you bin it may affect your results

– bin for each discrete value you observed in your scores

• Scores related to base frequency?• Scores differ between global and local

alignments?• Score distribution?

BLAST

Main idea: Construct a dictionary of all the words in the queryAlignment initiated between words of alignment score T

Alignment:Ungapped extensions until score below statistical threshold

Output:All local alignments with score

> statistical threshold

……

……

query

DB

query

scan

BLASTA C G A A G T A A G G T C C A G T

C

C

C

T

T

C C

T

G

G

A T

T

G

C

G

A

Example:

k = 4, T = 4

The matching word GGTC initiates an alignment

Extension to the left and right with no gaps until alignment falls < 50%

Output:

GTAAGGTCC

GTTAGGTCC

Gapped BLASTA C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

AAdded features:

• Pairs of words can initiate alignment

• Extensions with gaps in a band around anchor

Output:

GTAAGGTCCAGTGTTAGGTC-AGT

• Advantages– Fast!!!!– A few minute to search a database of 1011

bases

• Disadvantages– Sensitivity may be low– Often misses weak homologies

New improvement

• Make it even faster– But even less sensitive– Mainly for aligning very similar sequences or

really long sequences • E.g. whole genome vs whole genome

• Make it more sensitive– PSI-BLAST: iteratively add more homologous

sequences– PatternHunter: discontinuous seeds

Things we’ve covered so far

• Global alignment– Needleman-Wunsch and variants– Improvement on space and time

• Local Alignment– Smith-Waterman

• Heuristic algorithms– BLAST families

• Statistics for sequence alignment– Extreme value distribution

Commonality

• They all deal with aligning two sequences– Pair-wise sequence alignment

Today

• Aligning multiple sequences all together– Multiple sequence alignment

Motivation

• A faint similarity between two sequences becomes very significant if present in many

• Protein domains

• Motifs responsible for gene regulation

Definition

• Given N sequences x1, x2,…, xN:– Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L• Score of the global map is maximum

• Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences

• Same for a multiple alignment!

Scoring Function

• Ideally:– Find alignment that maximizes probability that

sequences evolved from common ancestor

x

yz

w

v

? Phylogenetic tree

or evolution tree

Scoring Function (cont’d)

• Unfortunately: too many parameters

• Compromises:– Ignore phylogenetic tree

• Compute from pair-wise scores– Based on sum of all pair-wise scores– Based on scores with a consensus sequence

First assumption

• Columns are independent– Similar in pair-wise alignment

• Therefore, the score of an alignment is the sum of all columns

• Need to decide how to score a single column

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d)

• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments

S(m) = k<l s(mk, ml)

s(mk, ml): score of induced alignment (k,l)

Example:

x: AC-GCGG-C

y: AC-GC-GAGz: GCCGC-GAG

A C G T -

A 1 -1 -1 -1 -1

C -1 1 -1 -1 -1

G -1 -1 1 -1 -1

T -1 -1 -1 1 -1

- -1 -1 -1 -1 0

(A,A) + (A,G) x 2 = -1

(C,C) x 3 = 3

(-,A) x 2 + (A,A) = -1

Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

Sum Of Pairs (cont’d)• Drawback: no evolutionary characterization

– Every sequence derived from all others• Heuristic way to incorporate evolution tree

– Weighted Sum of Pairs:

Human

Mouse

Chicken

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

Consensus score

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC

CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Find optimal consensus string m* to maximize

S(m) = i s(m*, mi)

s(mk, ml): score of pairwise alignment (k,l)

Consensus sequence:

Multiple Sequence Alignments

Algorithms

Multidimensional Dynamic Programming (MDP)

Generalization of Needleman-Wunsh:• Find the longest path in a high-dimensional cube

– As opposed to a two-dimensional grid

• Uses a N-dimensional matrix – As apposed to a two-dimensional array

• Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]

F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

• Example: in 3D (three sequences):

• 23 – 1 = 7 neighbors/cell

F(i-1,j-1,k-1) + S(xi, xj, xk),

F(i-1,j-1,k ) + S(xi, xj, -),

F(i-1,j ,k-1) + S(xi, -, xk),

F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk),

F(i-1,j ,k ) + S(xi, -, -),

F(i ,j-1,k ) + S(-, xj, -),

F(i ,j ,k-1) + S(-, -, xk)


(i,j,k)

(i,j,k-1)

(i-1,j,k-1)(i-1,j-1,k-1)

(i-1,j-1,k)

(i,j-1,k)

(i-1,j,k)

(i,j-1,k-1)


Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Faster MDP

• Carrillo & Lipman, 1988– Branch and bound– Other heuristics

• Practical for about 6 sequences of length about 200-300.

Progressive Alignment

• Multiple Alignment is NP-hard• Most used heuristic: Progressive Alignment

Algorithm:1. Align two of the sequences xi, xj

2. Fix that alignment

3. Align a third sequence xk to the alignment xi,xj

4. Repeat until all sequences are aligned

Running Time: O(NL2)Each alignment takes O(L2)

Repeat N times

Progressive Alignment

• When evolutionary tree is known:– Align closest first, in the order of the tree

Example:Order of alignments: 1. (x,y)

2. (z,w)3. (xy, zw)

x

w

y

z

Progressive Alignment: CLUSTALW

CLUSTALW: most popular multiple protein alignment

Algorithm:1. Find all dij: alignment dist (xi, xj)

• High alignment score => short distance

2. Construct a tree

(Neighbor-joining hierarchical clustering. Will discuss in future)

3. Align nodes in order of decreasing similarity

+ a large number of heuristics

CLUSTALW example

• S1 ALSK

• S2 TNSD

• S3 NASK

• S4 NTSD

CLUSTALW example

• S1 ALSK

• S2 TNSD

• S3 NASK

• S4 NTSDs1 s2 s3 s4

s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0 Distance matrix

CLUSTALW example

• S1 ALSK

• S2 TNSD

• S3 NASK


s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0

s1

s3

s2

s4

CLUSTALW example

• S1 ALSK

• S2 TNSD

• S3 NASK


s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0

s1

s3

s2

s4

-ALSKNA-SK

CLUSTALW example

• S1 ALSK

• S2 TNSD

• S3 NASK


s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0

s1

s3

s2

s4

-ALSKNA-SK

-TNSDNT-SD

CLUSTALW example

• S1 ALSK

• S2 TNSD

• S3 NASK


s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0

s1

s3

s2

s4

-ALSKNA-SK

-TNSDNT-SD

-ALSK-TNSDNA-SKNT-SD

Question: how do you align two alignments?

Aligning two alignments

• You can treat each column in an alignment as a single letter– Remember in the case of gene finder, we aligned

three nucleic acids at a time

• How do we score it?– Naïve: compute Sum of Pair

• Better: only compute the cross terms– We already have (K, K) and (D, D)– Need to add 2x(K, D)

-ALSKNA-SK

-TNSDNT-SD

CLUSTALW & the CINEMA viewer

Problems with progressive alignment:• Depend on pair-wise alignments• If sequences are very distantly related, much higher likelihood of

errors• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Iterative Refinement

Frozen!

Now clear: correct y should be GA-CTT


Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned4. For j = 1 to N,

Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to convergeRunning time: O(kNL2), k: number of iterations

Iterative Refinement (cont’d)

For each sequence y1. Remove y2. Realign y

(while rest fixed)

x

y

z

x,z fixed projection

allow y to vary


Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA


• Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Restricted MDP

• Similar to bounded DP in pair-wise alignment1. Construct progressive multiple alignment m

2. Run MDP, restricted to radius R from m

Running Time: O(2N RN-1 L)

x

y

z

Restricted MDP

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

• Within radius 1 of the optimal

Restricted MDP will fix it.

Other approaches

• Profile Hidden Markov Models– Statistical learning methods– Will discuss in future

Multiple alignment tools

• Clustal W (Thompson, 1994)– Most popular

• PRRP (Gotoh, 1993)• HMMT (Eddy, 1995)• DIALIGN (Morgenstern, 1998)• T-Coffee (Notredame, 2000)• MUSCLE (Edgar, 2004)• Align-m (Walle, 2004)• PROBCONS (Do, 2004)

In summary

• Multiple alignment algorithms:– MDP (too slow)

• B&B doesn’t solve the problem entirely

– Progressive alignment: clustalW– Iterative refinement– Restricted MDP

CS 5263 Bioinformatics

Documents

Transcript of CS 5263 Bioinformatics