6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching Burkhard Morgenstern...

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple Alignment and Motif Searching

Burkhard Morgenstern

Universität Göttingen

Institute of Microbiology and Genetics

Department of Bioinformatics

Tunis, March 2007

Multiple Alignment and Motif Searching

http://www.gobics.de/

burkhard/teaching/tunis_07.php

www.gobics.de/burkhard/teaching/tunis_07.php

Information flow in the cell

Sequence -> Structure -> Function

gap between sequence and structure/function data

Lots of data available at the sequence level

Fewer data at the structure and function level

Exponential growth of data bases

Major goal of bioinformatics: close the gap between sequence information and structure/function information

Most important tool for sequence analysis: sequence comparison

Simple approach: dot plot, more advanced approach: sequence alignment

The dot plot

Gibbs and McIntyre (1970)

The dot plot

Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y

Two sequences to be compared

The dot plot

Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y

Comparison matrix

The dot plot

Y Q E W T Y I V A R E A Q Y E C I X V M R E Q Y

Search pairs of identical residues

The dot plot

Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

Dot plot: dot (X) for all pairs of identical residues

The dot plot

Homologies as diagonal lines from top-left to bottom-right corner

The dot plot

Inversions as diagonals from bottom left to top right

The dot plot

Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X

Repeats as parallel diagonals

The dot plot

Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X

The dot plot

Advantages:

1. Various types of similarity detectable (repeats, inversions)

2. Useful for large-scale analysis

Use filtering for long sequeces: dots represent matching segments instead of matching single residues

The dot plot

Pair-wise sequence alignment

Evolutionary or structurally related sequences:

alignment possible

Sequence homologies represented by inserting gaps

T Y I V A R E Q Y E C I V M R E A Q Y

Two input sequences

T Y I V A R E Q Y E C I V M R E A Q Y

Comparison matrix for two sequences

T Y I V A R E Q Y E C I X V X M R X E X X A X Q X Y X X Dot plot for two sequences

T Y I V A R E Q Y E C I X V X M R X E X X A X Q X Y X X

Similarities in same relative order over entire seqences

T Y I V A R E Q Y E C I X V X M R X E X A Q X Y X

Global alignment of sequences possible

T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X

Alignment corresponds to path through comparison matrix

Matches (red), mis-matches (green), gaps (blue)

(global) alignment: write sequences on top of each other, gaps represented by dash symbols

T Y I V A R E Q Y E

C I V M R E A Q Y

Input sequences

T Y I V A R E - Q Y E

C - I V M R E A Q Y –

alignment of input sequences

C - I V M R E A Q Y -

alignment consists matches (red), mismatches (green) and gaps (blue)

Basic task:

Find ‘best’ alignment of two sequences

= alignment that reflects structural and evolutionary relations

Questions:

1. What is a good alignment?

2. How to find the best alignment?

Idea: consider alignment as hypothesis about evolution of sequences.

gaps correspond to insertions/deletions mismatches correspond to substitutions

C - I V M R E A - Q Y

Problem: astronomical number of possible alignments

T Y I V A R E Q Y E

C I - V M R E A Q Y

- C I V M R E A Q Y –

stupid computer has to find out: which alignment is best ??

First (simplified) rules:

1. minimize number of mismatches

2. maximize number of matches

General assumption: sequences not too distantly related.

In this case: mismatches (substitutions) and gaps (insertions/deletions) unlikely

Consequence: good alignment should reduce gaps and mismatches

C I - V M R E A Q Y –

Second (simplified) rule:

minimize number of gaps

T Y I V - A R E - Q Y E

C I - V M - R E A Q Y –

Second (simplified) rule:

minimize number of gaps

Parsimony principle: minimize number of evolutionary events

For protein sequences: different degrees of similarity among amino

acids. counting matches/mismatches

oversimplistic

T Y I V

Protein sequences to be aligned

T Y I V

T L - V

Possible alignment

T Y I V

T - L V

Alternative alignment

T Y I V

T - L V

Some amino acid residues are more similar to each other than others

Therefore: similarity among amino acid residues has to be taken into account.

T Y I V

T - L V

To assess quality of protein alignments:

use similarity scores for amino acids

s(a,b) similarity score for amino acids a and b

Similarity measured by substitution matrices based on substitution probabilities

Important substitution matrices:

PAM (M. Dayhoff) BLOSUM (S. Henikoff / J. Henikoff)

The PAM matrix:

Consider probability pa,b of substitution a → b (or b → a) for amino acids a and b

Define for amino acids a and b similarity score S(a,b) based on probability pa,b

First task: find out pa,b for every pair of amino acids a, b

The PAM matrix:

Use closely related protein families – no alignment problem, no double substitutions

Construct phylogenetic tree with parsimony method

Count substitution frequencies/probabilities Normalize substitution probabilities Extrapolate probabilities for larger

evolutionary distances

Finally: define similarity score

S(a,b) = log (pa,b / qa qb)

qa = (relative) frequency of amino acid a

T Y I V

T - L V

Given a similarity score s(a,b) for pairs of amino acids, define quality score of alignment as:

sum of similarity values s(a,b) of aligned residues

minus gap penalty g for each residue aligned with a gap

T Y I V

T - L V

Example:

Score = s(T,T) + s(I,L) + s (V,V) - g

T Y I V T - L V

Next question: find alignment with best score

Dynamic-programming algorithm finds alignment with best score.

(Needleman and Wunsch, 1970)

T Y I V A R E A Q Y E

- C I V M R E - Q Y –

T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X

T Y I V A R E A Q Y E X X C X I X V X M X R X E X X Q X Y X X

T Y I V A R E A Q Y E

- C I V M R E - Q Y –

T W L V - R E A Q I - C I V M R E - H Y

Score of alignment: Sum of similarity values of aligned residues minus gap penatly

Example: S = - g + s(W,C) + s(L,L) + s(V,V) - g + s(R,R) …

T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X

T W L V R E A Q Y I X X C X Alignment corresponds I X to path through V X comparison matrix M X R X E X X H X Y X X

i T W L V R E A Q Y I X X Dynamic programming: C X Calculate scores S(i,j) I X of optimal alignment of V X prefixes up to positions M X i and j. j R X E H Y

T W L V - R - C I V M R

i T W L V R E A Q Y I X X C X S(i,j) can be calculated from I X possible predecessors V X S(i-1,j-1), S(i,j-1), S(i-1,j). M X j R X E H Y

i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from top left = V X M X S(i-1,j-1) + s(R,R) j R X E H Y

i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from above = V X j-1M X S(i,j-1) – g j R X E H Y

T W L V R - - C I V M R

i-1 i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from left = V X M X S(i-1,j) – g j R X X E H Y

T W L - - V R - C I V M R -

i-1 i T W L V R E A Q Y I X X C X Score of optimal path = I X V X Maximum of these three M X values j R X X E H Y

T W L - - V R - C I V M R -

Recursion formula for global alignment:

For sequences x and y

yxsjiS

),()1,1(

max),(

T W L V R C I V M R E H Y

T W L V R x x x C x x x I x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

T W L V R x x x C x x x I x x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

T W L V R x x x C x x x I x x x V x x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

T W L V R x x x C x x x I x x x V x x x M x x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x H x x Y x x Fill matrix from top left to bottom right:

T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x Y x x Fill matrix from top left to bottom right:

T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x Fill matrix from top left to bottom right:

T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

T W L V R x x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

T W L V R x x x x C x x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

T W L V R x x x x C x x x x I x x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

T W L V R x x x x C x x x x I x x x x V x x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Fill matrix from top left to bottom right:

T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Find optimal alignment by trace-back procedure

T W L V R x x x x x x C x I x V x M x R x E x H x Y x Initial matrix entries?

T W L V R

C X Entries S(i,j) scores

I X of optimal alignment of

j V X prefixes up to positions

M i and j.

T W L V

- C I V

i T W L V R j X X X X X C Entries S(i,0) scores I of optimal alignment of V prefix up to positions M i and empty prefix. R E Score = - i* g H Y T W L V - - - -

T W L V R C I V M R E H Y Initial matrix entries: Example, g = 2

T W L V R 0 -2 -4 -6 -8 -10 C -2 I -4 V -6 M -8 R -10 E -12 H -14 Y -16 Initial matrix entries: Example, g = 2

Pair-wise global alignment

T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X

T W L V - R E A Q I - C I V M R E - F Y

Pair-wise global alignment

Computational complexity: how does program run time and memory depend on size of input data?

l1 and l2 length of sequences:Computing time and memory proportional to

l1 * l2

Time and memory complexity = O(l1 * l2)

More realistic gap penalty: affine-linear instead of linear

Penalty for gap of length l:

c0 + (l-1)* c1

c0 = ‘gap-opening penalty’

c0 = ‘gap-extension penalty’

Pair-wise local alignment

So far: global alignment considered: sequences aligned over their entire length.

But: sequences often share only local sequence similarity (conserved genes or domains)

Most important application: database searching

T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X

Problem:

Find pair of segments with maximal alignment score (not necessarily part of optimal global alignment!)

Equivalent: find path starting and ending anywhere in the matrix.

S(i,j) = max { S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }

Recursion formula for local alignment:

S(i,j) = max { 0 , S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }

T W L V R 0 0 0 0 0 0 C 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 Initial matrix entries = 0

T W L V R 0 0 0 0 0 0 C 0 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 s(C,T) = -2

yxsjiS

),()1,1(

max),(

Recursion formula for local alignment:

),()1,1(

max),(gijS

yxsjiS

For trace-back:

Store positions imax and jmax with

S(imax ,jmax) maximal

Algorithm by Smith and Waterman (1983)

Implementation: e.g. BestFit in GCG package

Complexity: l1 and l2 length of sequences:computing time

and memory proportional to l1 * l2

Time and space complexity = O(l1 * l2)

Too slow for data base searching! Therefore tools like BLAST necessary for

database searching

The Basic Local Alignment Search Tool (BLAST)

New BLAST version (1997)

Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST

(PSI BLAST)

PSI BLAST:

1. search database with standard BLAST

2. take best hits and create multiple alignment

3. calculate profile from multiple alignment

4. search database again with profile as query

profile for sequence family or motif:

table of amino acid/nucleotide frequencies at any position in alignment.

Profile: frequencies of nucleotides at every position.

seq1 A T T G – A T

seq2 C T T G T A G

seq3 A - - G T A T

seq4 A T G G T G T

seq5 A C T G T A C

A 80 0 0 0 0 80 0

T 0 75 75 0 100 0 60

C 20 25 0 0 0 0 20

G 0 0 25 100 0 20 20

Tools for multiple sequence alignment

s1 T Y I M R E A Q Y E S A Q

s2 T C I V M R E A Y E

s3 Y I M Q E V Q Q E R

s4 W R Y I A M R E Q Y E

s1 - T Y I - M R E A Q Y E S A Q

s2 - T C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -