6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching Burkhard Morgenstern...

Post on 15-Jan-2016

217 views 0 download

Tags:

Transcript of 6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching Burkhard Morgenstern...

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple Alignment and Motif Searching

Burkhard Morgenstern

Universität Göttingen

Institute of Microbiology and Genetics

Department of Bioinformatics

Tunis, March 2007

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple Alignment and Motif Searching

http://www.gobics.de/

burkhard/teaching/tunis_07.php

04/21/23 Burkhard Morgenstern, Tunis 2007

www.gobics.de/burkhard/teaching/tunis_07.php

04/21/23 Burkhard Morgenstern, Tunis 2007

Information flow in the cell

04/21/23 Burkhard Morgenstern, Tunis 2007

Information flow in the cell

Idea:

Sequence -> Structure -> Function

04/21/23 Burkhard Morgenstern, Tunis 2007

Information flow in the cell

04/21/23 Burkhard Morgenstern, Tunis 2007

Information flow in the cell

gap between sequence and structure/function data

Lots of data available at the sequence level

Fewer data at the structure and function level

04/21/23 Burkhard Morgenstern, Tunis 2007

Exponential growth of data bases

04/21/23 Burkhard Morgenstern, Tunis 2007

Major goal of bioinformatics: close the gap between sequence information and structure/function information

Most important tool for sequence analysis: sequence comparison

Simple approach: dot plot, more advanced approach: sequence alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Gibbs and McIntyre (1970)

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y

Two sequences to be compared

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y

Comparison matrix

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y I V A R E A Q Y E C I X V M R E Q Y

Search pairs of identical residues

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

Dot plot: dot (X) for all pairs of identical residues

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

Homologies as diagonal lines from top-left to bottom-right corner

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

Inversions as diagonals from bottom left to top right

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X

Repeats as parallel diagonals

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

Advantages:

1. Various types of similarity detectable (repeats, inversions)

2. Useful for large-scale analysis

Use filtering for long sequeces: dots represent matching segments instead of matching single residues

04/21/23 Burkhard Morgenstern, Tunis 2007

The dot plot

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

Evolutionary or structurally related sequences:

alignment possible

Sequence homologies represented by inserting gaps

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E C I V M R E A Q Y

Two input sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E C I V M R E A Q Y

Comparison matrix for two sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E C I X V X M R X E X X A X Q X Y X X Dot plot for two sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E C I X V X M R X E X X A X Q X Y X X

Similarities in same relative order over entire seqences

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E C I X V X M R X E X A Q X Y X

Global alignment of sequences possible

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X

Alignment corresponds to path through comparison matrix

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X

Matches (red), mis-matches (green), gaps (blue)

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X

Matches (red), mis-matches (green), gaps (blue)

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

(global) alignment: write sequences on top of each other, gaps represented by dash symbols

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E

C I V M R E A Q Y

Input sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

C - I V M R E A Q Y –

alignment of input sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

C - I V M R E A Q Y -

alignment consists matches (red), mismatches (green) and gaps (blue)

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

C - I V M R E A Q Y –

Basic task:

Find ‘best’ alignment of two sequences

= alignment that reflects structural and evolutionary relations

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

C - I V M R E A Q Y –

Questions:

1. What is a good alignment?

2. How to find the best alignment?

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

C - I V M R E A Q Y –

Idea: consider alignment as hypothesis about evolution of sequences.

gaps correspond to insertions/deletions mismatches correspond to substitutions

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

C - I V M R E A - Q Y

Problem: astronomical number of possible alignments

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E Q Y E

C I - V M R E A Q Y

Problem: astronomical number of possible alignments

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

- C I V M R E A Q Y –

Problem: astronomical number of possible alignments

stupid computer has to find out: which alignment is best ??

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

- C I V M R E A Q Y –

First (simplified) rules:

1. minimize number of mismatches

2. maximize number of matches

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

- C I V M R E A Q Y –

General assumption: sequences not too distantly related.

In this case: mismatches (substitutions) and gaps (insertions/deletions) unlikely

Consequence: good alignment should reduce gaps and mismatches

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

C I - V M R E A Q Y –

First (simplified) rules:

1. minimize number of mismatches

2. maximize number of matches

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

- C I V M R E A Q Y –

First (simplified) rules:

1. minimize number of mismatches

2. maximize number of matches

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

- C I V M R E A Q Y –

First (simplified) rules:

1. minimize number of mismatches

2. maximize number of matches

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E - Q Y E

C I - V M R E A Q Y –

Second (simplified) rule:

minimize number of gaps

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V - A R E - Q Y E

C I - V M - R E A Q Y –

Second (simplified) rule:

minimize number of gaps

Parsimony principle: minimize number of evolutionary events

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

For protein sequences: different degrees of similarity among amino

acids. counting matches/mismatches

oversimplistic

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V

T L V

Protein sequences to be aligned

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V

T L - V

Possible alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V

T - L V

Alternative alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V

T - L V

Some amino acid residues are more similar to each other than others

Therefore: similarity among amino acid residues has to be taken into account.

04/21/23 Burkhard Morgenstern, Tunis 2007

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V

T - L V

To assess quality of protein alignments:

use similarity scores for amino acids

s(a,b) similarity score for amino acids a and b

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

Similarity measured by substitution matrices based on substitution probabilities

Important substitution matrices:

PAM (M. Dayhoff) BLOSUM (S. Henikoff / J. Henikoff)

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

The PAM matrix:

Consider probability pa,b of substitution a → b (or b → a) for amino acids a and b

Define for amino acids a and b similarity score S(a,b) based on probability pa,b

First task: find out pa,b for every pair of amino acids a, b

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

The PAM matrix:

Use closely related protein families – no alignment problem, no double substitutions

Construct phylogenetic tree with parsimony method

Count substitution frequencies/probabilities Normalize substitution probabilities Extrapolate probabilities for larger

evolutionary distances

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

Finally: define similarity score

S(a,b) = log (pa,b / qa qb)

qa = (relative) frequency of amino acid a

04/21/23 Burkhard Morgenstern, Tunis 2007

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V

T - L V

Given a similarity score s(a,b) for pairs of amino acids, define quality score of alignment as:

sum of similarity values s(a,b) of aligned residues

minus gap penalty g for each residue aligned with a gap

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V

T - L V

Example:

Score = s(T,T) + s(I,L) + s (V,V) - g

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V T - L V

Next question: find alignment with best score

Dynamic-programming algorithm finds alignment with best score.

(Needleman and Wunsch, 1970)

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E A Q Y E

- C I V M R E - Q Y –

Alignment corresponds to path through comparison matrix

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E A Q Y E X X C X I X V X M X R X E X X Q X Y X X

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T Y I V A R E A Q Y E

- C I V M R E - Q Y –

Alignment corresponds to path through comparison matrix

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V - R E A Q I - C I V M R E - H Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

Score of alignment: Sum of similarity values of aligned residues minus gap penatly

T W L V - R E A Q I - C I V M R E - H Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

Example: S = - g + s(W,C) + s(L,L) + s(V,V) - g + s(R,R) …

T W L V - R E A Q I - C I V M R E - H Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X

T W L V - R E A Q I - C I V M R E - H Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R E A Q Y I X X C X Alignment corresponds I X to path through V X comparison matrix M X R X E X X H X Y X X

T W L V - R E A Q I - C I V M R E - H Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

i T W L V R E A Q Y I X X Dynamic programming: C X Calculate scores S(i,j) I X of optimal alignment of V X prefixes up to positions M X i and j. j R X E H Y

T W L V - R - C I V M R

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

i T W L V R E A Q Y I X X C X S(i,j) can be calculated from I X possible predecessors V X S(i-1,j-1), S(i,j-1), S(i-1,j). M X j R X E H Y

T W L V - R - C I V M R

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from top left = V X M X S(i-1,j-1) + s(R,R) j R X E H Y

T W L V - R - C I V M R

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from above = V X j-1M X S(i,j-1) – g j R X E H Y

T W L V R - - C I V M R

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

i-1 i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from left = V X M X S(i-1,j) – g j R X X E H Y

T W L - - V R - C I V M R -

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

i-1 i T W L V R E A Q Y I X X C X Score of optimal path = I X V X Maximum of these three M X values j R X X E H Y

T W L - - V R - C I V M R -

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

Recursion formula for global alignment:

For sequences x and y

gijS

gjiS

yxsjiS

jiS

ji

)1,(

),1(

),()1,1(

max),(

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R C I V M R E H Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x C x x x I x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x C x x x I x x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x C x x x I x x x V x x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x C x x x I x x x V x x x M x x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x H x x Y x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x Y x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x x C x x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x x C x x x x I x x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x x C x x x x I x x x x V x x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Fill matrix from top left to bottom right:

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Find optimal alignment by trace-back procedure

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R x x x x x x C x I x V x M x R x E x H x Y x Initial matrix entries?

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

i

T W L V R

X X

C X Entries S(i,j) scores

I X of optimal alignment of

j V X prefixes up to positions

M i and j.

R

E

H

Y

T W L V

- C I V

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

i T W L V R j X X X X X C Entries S(i,0) scores I of optimal alignment of V prefix up to positions M i and empty prefix. R E Score = - i* g H Y T W L V - - - -

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R C I V M R E H Y Initial matrix entries: Example, g = 2

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

T W L V R 0 -2 -4 -6 -8 -10 C -2 I -4 V -6 M -8 R -10 E -12 H -14 Y -16 Initial matrix entries: Example, g = 2

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise global alignment

T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X

T W L V - R E A Q I - C I V M R E - F Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise global alignment

Computational complexity: how does program run time and memory depend on size of input data?

l1 and l2 length of sequences:Computing time and memory proportional to

l1 * l2

Time and memory complexity = O(l1 * l2)

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

More realistic gap penalty: affine-linear instead of linear

Penalty for gap of length l:

c0 + (l-1)* c1

c0 = ‘gap-opening penalty’

c0 = ‘gap-extension penalty’

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

So far: global alignment considered: sequences aligned over their entire length.

But: sequences often share only local sequence similarity (conserved genes or domains)

Most important application: database searching

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X

T W L V - R E A Q I - C I V M R E - F Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X

T W L V - R E A Q I - C I V M R E - F Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

Problem:

Find pair of segments with maximal alignment score (not necessarily part of optimal global alignment!)

Equivalent: find path starting and ending anywhere in the matrix.

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X

T W L V - R E A Q I - C I V M R E - F Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

Recursion formula for global alignment:

S(i,j) = max { S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

Recursion formula for local alignment:

S(i,j) = max { 0 , S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

T W L V R 0 0 0 0 0 0 C 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 Initial matrix entries = 0

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

T W L V R 0 0 0 0 0 0 C 0 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 s(C,T) = -2

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

Recursion formula for global alignment:

gijS

gjiS

yxsjiS

jiS

ji

)1,(

),1(

),()1,1(

max),(

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

Recursion formula for local alignment:

0

)1,(

),1(

),()1,1(

max),(gijS

gjiS

yxsjiS

jiS

ji

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise sequence alignment

For trace-back:

Store positions imax and jmax with

S(imax ,jmax) maximal

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X

T W L V - R E A Q I - C I V M R E - F Y

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

Algorithm by Smith and Waterman (1983)

Implementation: e.g. BestFit in GCG package

04/21/23 Burkhard Morgenstern, Tunis 2007

Pair-wise local alignment

Complexity: l1 and l2 length of sequences:computing time

and memory proportional to l1 * l2

Time and space complexity = O(l1 * l2)

Too slow for data base searching! Therefore tools like BLAST necessary for

database searching

04/21/23 Burkhard Morgenstern, Tunis 2007

The Basic Local Alignment Search Tool (BLAST)

New BLAST version (1997)

Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST

(PSI BLAST)

04/21/23 Burkhard Morgenstern, Tunis 2007

The Basic Local Alignment Search Tool (BLAST)

PSI BLAST:

1. search database with standard BLAST

2. take best hits and create multiple alignment

3. calculate profile from multiple alignment

4. search database again with profile as query

04/21/23 Burkhard Morgenstern, Tunis 2007

The Basic Local Alignment Search Tool (BLAST)

04/21/23 Burkhard Morgenstern, Tunis 2007

The Basic Local Alignment Search Tool (BLAST)

profile for sequence family or motif:

table of amino acid/nucleotide frequencies at any position in alignment.

04/21/23 Burkhard Morgenstern, Tunis 2007

The Basic Local Alignment Search Tool (BLAST)

Profile: frequencies of nucleotides at every position.

seq1 A T T G – A T

seq2 C T T G T A G

seq3 A - - G T A T

seq4 A T G G T G T

seq5 A C T G T A C

A 80 0 0 0 0 80 0

T 0 75 75 0 100 0 60

C 20 25 0 0 0 0 20

G 0 0 25 100 0 20 20

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 T Y I M R E A Q Y E S A Q

s2 T C I V M R E A Y E

s3 Y I M Q E V Q Q E R

s4 W R Y I A M R E Q Y E

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 - T Y I - M R E A Q Y E S A Q

s2 - T C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 - T Y I - M R E A Q Y E S A Q

s2 - T C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 - T Y I - M R E A Q Y E S A Q

s2 - T C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 - T Y I - M R E A Q Y E S A Q

s2 - T C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

General information in multiple alignment: Functionally important regions more conserved than

non-functional regions Local sequence conservation indicates functionality!

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 - T Y I - M R E A Q Y E S A Qs2 - T C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

For phylogeny reconstruction: Estimate pairwise distances between sequences

(distance-based methods for tree reconstruction) Estimate evloutionary events in evolution (parsimony

and maximum likelihood methods)

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 - T Y I - M R E A Q Y E S A Q

s2 - T C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Astronomical number of possible alignments!

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 - T Y I - M R E A Q Y E S A Q

s2 - T C I V M R E A - - - Y E -

s3 Y I - - - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Astronomical number of possible alignments!

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

s1 - T Y I - M R E A Q Y E S A Q

s2 - T C I V M R E A - - - Y E -

s3 Y I - - - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Computer has to decide: which one is best??

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

Questions in development of multiple-alignment programs (as in pairwise alignment):

(1) What is a good alignment? → objective function (`score’)

(2) How to find a good alignment? → optimization algorithm

First question far more important !

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

Traditional Objective functions:

Define Score of alignments as

Sum of individual similarity scores S(a,b) Gap penalties

Needleman-Wunsch scoring system (1970)

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

Traditional Objective functions

Can be generalized to multiple alignment

(e.g. sum-of-pair score, tree alignment)

Needleman-Wunsch algorithm can also be generalized to multiple alignment, but:

Very time and memory consuming!

-> Heuristics needed

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

First question: how to score multiple alignments?

Possible scoring scheme:

Sum-of-pairs score

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

Use sum of scores of these p.a.

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

Complexity:

For sequences of length l1 * l2 * l3

O( l1 * l2 * l3 )

For n sequences ( average length l ):

O( ln )

Exponential complexity!

04/21/23 Burkhard Morgenstern, Tunis 2007

Multiple sequence alignment

Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment

Optimal solution not feasible:

-> Heuristics necessary

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Guide tree

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Idea: align closely related sequences first!

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WW--RLNDKEGYVPRNLLGLYP-

AVVIQDNSDIKVVP--KAKIIRD

YAVESEASFQPVAALERIN

WLNYNEERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WW--RLNDKEGYVPRNLLGLYP-

AVVIQDNSDIKVVP--KAKIIRD

YAVESEASVQ--PVAALERIN------

WLN-YNEERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN-

WW--RLNDKEGYVPRNLLGLYP-

AVVIQDNSDIKVVP--KAKIIRD

YAVESEASVQ--PVAALERIN------

WLN-YNEERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN--------

WW--RLNDKEGYVPRNLLGLYP--------

AVVIQDNSDIKVVP--KAKIIRD-------

YAVESEA---SVQ--PVAALERIN------

WLN-YNE---ERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

“Greedy” algorithm:

Consider partial solution of bigger problem

search best partial solution, fix solution search second-best partial solution that is consistent

with first solution, fix solution Search third-best partial solution … etc.

E.g.: Rucksack-Problem

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN--------

WW--RLNDKEGYVPRNLLGLYP--------

AVVIQDNSDIKVVP--KAKIIRD-------

YAVESEA---SVQ--PVAALERIN------

WLN-YNE---ERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

04/21/23 Burkhard Morgenstern, Tunis 2007

`Progressive´ Alignment

Most important software program:

CLUSTAL W:J. Thompson, T. Gibson, D. Higgins (1994), CLUSTAL

W: improving the sensitivity of progressive multiple sequence alignment … Nuc. Acids. Res. 22, 4673 - 4680

(~ 18.000 citations in the literature)

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

Problems with traditional approach:

Results depend on gap penalty

Heuristic guide tree determines alignment;

alignment used for phylogeny reconstruction

Algorithm produces global alignments.

04/21/23 Burkhard Morgenstern, Tunis 2007

Tools for multiple sequence alignment

Problems with traditional approach:

But:

Many sequence families share only local similarity

E.g. sequences share one conserved motif

04/21/23 Burkhard Morgenstern, Tunis 2007

Local sequence alignment

Find common motif in sequences; ignore the rest

EYENS

ERYENS

ERYAS

04/21/23 Burkhard Morgenstern, Tunis 2007

Local sequence alignment

Find common motif in sequences; ignore the rest

E-YENS

ERYENS

ERYA-S

04/21/23 Burkhard Morgenstern, Tunis 2007

Local sequence alignment

Find common motif in sequences; ignore the rest – Local alignment

E-YENSERYENSERYA-S

04/21/23 Burkhard Morgenstern, Tunis 2007

Local sequence alignment

Important methods for local multiple alignment:

•PIMA•MEME/MAST

Idea: expectation maximation.

04/21/23 Burkhard Morgenstern, Tunis 2007

Local sequence alignment

Traditional alignment approaches:

Either global or local methods!

04/21/23 Burkhard Morgenstern, Tunis 2007

New question: sequence families with multiple local similarities

Neither local nor global methods appliccable

04/21/23 Burkhard Morgenstern, Tunis 2007

New question: sequence families with multiple local similarities

Alignment possible if order conserved

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Morgenstern, Dress, Werner (1996),PNAS 93, 12098-12103

Combination of global and local methods

Assemble multiple alignment from gap-free local pair-wise alignments (,,fragments“)

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

Consistency!

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Score of an alignment:

Define score of fragment f:

l(f) = length of fs(f) = sum of matches (similarity values)

P(f) = probability to find a fragment with length l(f) and at least s(f) matches in random sequences that have the same length as the input sequences.

Score w(f) = -ln P(f)

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Score of an alignment:

Define score of fragment f:

Define score of alignment as

sum of scores of involved fragments

No gap penalty!

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Score of an alignment:

Goal in fragment-based alignment approach: find

Consistent collection of fragments with maximum sum of weight scores

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaaccccctcgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc

Pair-wise alignment:

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaaccccctcgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc

Pair-wise alignment:

recursive algorithm finds optimal chain of

fragments.

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--

Pair-wise alignment:

recursive algorithm finds optimal chain of

fragments.

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--

Optimal pairwise alignment: chain of fragments with maximum sum of weights found by dynamic programming:

Standard fragment-chaining algorithm

Space-efficient algorithm

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Multiple alignment:

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Multiple alignment:

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaccctgaattgaagagtatcacataa

(1) Calculate all optimal pair-wise alignments

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Multiple alignment:

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

(1) Calculate all optimal pair-wise alignments

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Multiple alignment:

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

(1) Calculate all optimal pair-wise alignments

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Fragments from optimal pair-wise alignments might be inconsistent

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Fragments from optimal pair-wise alignments might be inconsistent

1. Sort fragments according to scores

2. Include them one-by-one into growing multiple alignment – as long as they are consistent

(greedy algorithm, comparable to knapsack problem)

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Consistency problem

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Consistency problem

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Upper and lower bounds for alignable positions

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

Upper and lower bounds for alignable positions

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagt taaactcccccgtgcttag

Cagtgcgtgtattact aacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

Upper and lower bounds for alignable positions

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taata-----gttaaactcccccgtgcttag

Cagtgcgtgtatta-----ctaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

Upper and lower bounds for alignable positions

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Upper and lower bounds for alignable positions

site x = [i,p] (sequence i, position p)

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Upper and lower bounds for alignable positions

Calculate upper bound bl(x,i) and lower bound bu(x,i) for each x and sequence i

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Upper and lower bounds for alignable positions

bl(x,i) and bu(x,i) updated for each new fragment in alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Consistency bounds are to be updated for each new fragment that is included in to the growing Alignment

Efficient algorithm

(Abdeddaim and Morgenstern, 2002)

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

Advantages of segment-based approach:

Program can produce global and local alignments!

Sequence families alignable that cannot be aligned with standard methods

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

DIALIGN is available

Online at BiBiServ (Bielefeld Bioinformatics Server)

Downloadable UNIX/LINUX executables at BiBiServ

Source code (email to BM)

04/21/23 Burkhard Morgenstern, Tunis 2007

Program input

Program usage:

> dialign2-2 [options] <input_file>

<input_file> = multi-sequence file in FASTA-format

04/21/23 Burkhard Morgenstern, Tunis 2007

Program output

DIALIGN 2.2.1 ************* Program code written by Burkhard Morgenstern and Said Abdeddaim e-mail contact: bmorgen@gwdg.de Published research assisted by DIALIGN 2 should cite: Burkhard Morgenstern (1999). DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211 - 218.

For more information, please visit the DIALIGN home page at

http://bibiserv.techfak.uni-bielefeld.de/dialign/

program call: ./dialign2-2 -nt -anc s

Aligned sequences: length: ================== ======= 1) dog_il4 300 2) bla 200 3) blu 200

Average seq. length: 233.3

Please note that only upper-case letters are considered to be aligned.

04/21/23 Burkhard Morgenstern, Tunis 2007

Program output

Alignment (DIALIGN format): =========================== dog_il4 1 cagg------ ----GTTTGA atctgataca ttgc------ ---------- bla 1 ctga------ ---------- ---------- --------GC CAAGTGGGAA blu 1 ttttgatatg agaaGTGTGA aacaagctat cctatattGC TAAGTGGCAG 0000000000 0000000000 0000000000 0000000011 1111111111 dog_il4 25 ---------- --ATGGCACT GGGGTGAATG AGGCAGGCAG CAGAATGATC bla 17 ggtgtgaata catgggtttc cagtaccttc tgaggtccag agtacc---- blu 51 ccctggcttt ctATGTGCAC AGAATGGGAG GAAAGTGCCT GCTAGTGAGC 0000000000 0000000000 0000000000 0000000000 0000000000 dog_il4 63 GTACTGCAGC CCTGAGCTTC CACTGGCCCA TGTTGGTATC CTTGTATTTT bla 63 ---------- ---------- ---TTTCCCA TGTGCTCCAT GGTGGAATGG blu 101 CAGGGACTCA GAGAGAATGG AGTATAGGGG TCAGGGCat- ---------- 0000000000 0000000000 0009999999 9999999888 8888888888 dog_il4 113 TCCGCCCCTT CCCAGCACca gcattatcct ---GGGATTG GAGAAGGGGG bla 90 ACCACTCCTT CTCAGCACaa caaagcccaa gaaGGTGTTG CGTTCTAGAC blu 140 ---------- ---------- ---------- ---GGGGTGG CCTTAGGCTC 8888888888 8888888800 0000000000 0007777777 7777777777

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaac----------ggttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------TAATAGTTAaactccccCGTGC-TTag------

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa--

04/21/23 Burkhard Morgenstern, Tunis 2007

The DIALIGN approach

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

Program output

Alignment (DIALIGN format): =========================== dog_il4 1 cagg------ ----GTTTGA atctgataca ttgc------ ---------- bla 1 ctga------ ---------- ---------- --------GC CAAGTGGGAA blu 1 ttttgatatg agaaGTGTGA aacaagctat cctatattGC TAAGTGGCAG 0000000000 0000000000 0000000000 0000000011 1111111111 dog_il4 25 ---------- --ATGGCACT GGGGTGAATG AGGCAGGCAG CAGAATGATC bla 17 ggtgtgaata catgggtttc cagtaccttc tgaggtccag agtacc---- blu 51 ccctggcttt ctATGTGCAC AGAATGGGAG GAAAGTGCCT GCTAGTGAGC 0000000000 0000000000 0000000000 0000000000 0000000000 dog_il4 63 GTACTGCAGC CCTGAGCTTC CACTGGCCCA TGTTGGTATC CTTGTATTTT bla 63 ---------- ---------- ---TTTCCCA TGTGCTCCAT GGTGGAATGG blu 101 CAGGGACTCA GAGAGAATGG AGTATAGGGG TCAGGGCat- ---------- 0000000000 0000000000 0009999999 9999999888 8888888888 dog_il4 113 TCCGCCCCTT CCCAGCACca gcattatcct ---GGGATTG GAGAAGGGGG bla 90 ACCACTCCTT CTCAGCACaa caaagcccaa gaaGGTGTTG CGTTCTAGAC blu 140 ---------- ---------- ---------- ---GGGGTGG CCTTAGGCTC 8888888888 8888888800 0000000000 0007777777 7777777777

04/21/23 Burkhard Morgenstern, Tunis 2007

T-COFFEE

C. Notredame, D. Higgins, J. Heringa (2000), T-Coffee: A novel algorithm for multiple sequence alignment, J. Mol. Biol.

04/21/23 Burkhard Morgenstern, Tunis 2007

T-COFFEE

Problem with “progressive” approaches:

Strictly global alignments

Use only pair-wise comparison

04/21/23 Burkhard Morgenstern, Tunis 2007

T-COFFEE

Idea: Start with local and global pair-wise alignments (“primary

library” of alignments)

Construct “scondary library” of residues that are indirectly aligned by primary library.

Re-score residue pairs

Construct final alignment with “progressive” method

04/21/23 Burkhard Morgenstern, Tunis 2007

T-COFFEE

Advantage:

Combination of local and global approaches

Less sensitive against mis-alignments in progressive proceedure

04/21/23 Burkhard Morgenstern, Tunis 2007

T-COFFEE

04/21/23 Burkhard Morgenstern, Tunis 2007

04/21/23 Burkhard Morgenstern, Tunis 2007

T-COFFEE

T-COFFEE and DIALIGN: Less sensitive to spurious pairwise similarities Can handle local homologies better than

CLUSTAL

04/21/23 Burkhard Morgenstern, Tunis 2007

Most multi-alignment approaches automated, i.e. based on algorithmic rules. Two components:

Objective function: assess alignment quality

Optimization algorithm: find optimal or near-optimal alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Fully automated alignment programs necessary f no expert knowledge available if large amounts of data to be analyzed

But: Often no biologically reasonable

results Often additional information about

homologies etc. available

04/21/23 Burkhard Morgenstern, Tunis 2007

Idea for improved alignment

Use expert knowledge to influence alignment procedure

DIALIGN with user-defined anchor points

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

Alignment of large genomic sequences to identify functional elements (phylogenetic footprinting)

Göttgens et al., 2000, 2001, 2002, … Pollard et al., 2004

DIALIGN, MGA, PipMaker, LAGAN, AVID, Mummer, WABA, …

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

Gene-regulatory sites identified by mulitple sequence alignment (phylogenetic footprinting)

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

DIALIGN alignment of human and murine genomic sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

DIALIGN alignment of tomato and Thaliana genomic sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of Hox gene cluster:

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of Hox gene cluster:

DIALIGN able to identify small regulatory elements, but

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of Hox gene cluster:

DIALIGN able to identify small regulatory elements, but

Entire genes totally mis-aligned

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of Hox gene cluster:

DIALIGN able to identify small regulatory elements, but

Entire genes totally mis-aligned Reason for mis-alignment: duplications !

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

The Hox gene cluster:

4 Hox gene clusters in pufferfish. 14 genes, different genes in different clusters!

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of large genomic sequences

The Hox gene cluster:

Complete mis-alignment of entire genes!

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Conserved motivs; no similarity outside motifs

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in two sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in two sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in two sequences

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Mis-alignment would have lower score!

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in one sequence

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in one sequence

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in one sequence

Possible mis-alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in one sequence

S3

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in one sequence

S3

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in one sequence

S3

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Duplication in one sequence

S3

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Consistency problem

S3

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

More plausible alignment – and higher score:

S3

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Consistency problem

S3

04/21/23 Burkhard Morgenstern, Tunis 2007

Alignment of sequence duplications

S1

S2

Alternative alignment; probably biologically wrong;lower numerical score!

S3

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

Biologically meaningful alignment not possible by automated approaches.

Idea: use expert knowledge to guide alignment procedure

User defines a set anchor points that are to be „respected“ by the alignment procedure

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Use known homology as anchor point

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Use known homology as anchor point

Anchor point = anchored fragment (gap-free pair of segments)

Remainder of sequences aligned automatically

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Alignment of anchored positions a and b not enforced – a and b may be un-aligned –, but:

a is only residue that can be aligned to b

Residues left of a aligned with residues left of b

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

-------NLF VALYDFVASG DNTLSITKGE klrvlgynhn

iihredkGVI YALWDYEPQN DDELPMKEGD cmt-------

Anchored alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

Anchor points in multiple alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQND DELPMKEGDCMT

GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

Anchor points in multiple alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Anchored sequence alignment

-------NLF V-ALYDFVAS GD-------- NTLSITKGEk lrvLGYNhn

iihredkGVI Y-ALWDYEPQ ND-------- DELPMKEGDC MT-------

-------GYQ YrALYDYKKE REedidlhlg DILTVNKGSL VA-LGFS--

Anchored multiple alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

Goal:

Find optimal alignment (=consistent set of fragments) under costraints given by user-specified anchor points!

04/21/23 Burkhard Morgenstern, Tunis 2007

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Algorithmic questions

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMTGYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

04/21/23 Burkhard Morgenstern, Tunis 2007

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Algorithmic questions

04/21/23 Burkhard Morgenstern, Tunis 2007

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences

Algorithmic questions

04/21/23 Burkhard Morgenstern, Tunis 2007

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions

Algorithmic questions

04/21/23 Burkhard Morgenstern, Tunis 2007

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions length

Algorithmic questions

04/21/23 Burkhard Morgenstern, Tunis 2007

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions length score

Algorithmic questions

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

Requirements:

Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Inconsistent anchor points!

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaat---agttaaactcccccgtgcttag

Cagtgcgtgtattac-taacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Inconsistent anchor points!

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

Requirements:

Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

Find alignment under constraints given by anchor points!

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

Use data structures from multiple alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Greedy procedure for multiple alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Greedy procedure for multiple alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Question: which positions are still alignable ?

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

For each position x and each sequence Si exist an

upper bound ub(x,i) and a lower bound lb(x,i) for

residues y in Si that are alignable with x

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

For each position x and each sequence Si exist an

upper bound ub(x,i) and a lower bound lb(x,i) for

residues y in Si that are alignable with x

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

ub(x,i) and lb(x,i) updated during greedy procedure

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

Initial values of lb(x,i), ub(x,i)

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

ub(x,i) and lb(x,i) updated during greedy procedure

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

ub(x,i) and lb(x,i) updated during greedy procedure

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

Anchor points treated like fragments in greedy algorithm:

Sorted according to user-defined scores Accepted if consistent with previously accepted

anchors

ub(x,i) and lb(x,i) updated during greedy

procedure

Resulting values of ub(x,i) and lb(x,i) used as initial

values for alignment procedure

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

Initial values of lb(x,i), ub(x,i)

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

Initial values of lb(x,i), ub(x,i) calculated using anchor

points

04/21/23 Burkhard Morgenstern, Tunis 2007

Algorithmic questions

Ranking of anchor points to prioritize anchor points, e.g.

anchor points from verified homologies -- higher priority

automatically created anchor points (using CHAOS, BLAST, … ) -- lower priority

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: Hox gene cluster

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: Hox gene cluster

Use gene boundaries as anchor points

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: Hox gene cluster

Use gene boundaries as anchor points

+ CHAOS / BLAST hits

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: Hox gene cluster

no anchoring anchoring

Ali. Columns

2 seq 2958 3674

3 seq 668 1091

4 seq 244 195

Score 1166 1007

CPU time 4:22 0:19

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: Hox gene cluster

Example:

Teleost Hox gene cluster:

Score of anchored alignment 15 % higher than score of non-anchored alignment !

Conclusion: Greedy optimization algorithm does a bad job!

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: Improvement of Alignment programs

Two possible reasons for mis-alignments:

Wrong objective function: Biologically correct

alignment gets bad numerical score

Bad optimization algorithms: Biologically correct

alignment gets best numerical score, but algorithm

fails to find this alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: Improvement of Alignment programs

Two possible reasons for mis-alignments:

Anchored alignments can help to decide

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: RNA alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: RNA alignment

aa----CCCC AGC---GUAa gucgcuaucc a

cacucuCCCA AGC---GGAG Aac------- -

ccg----CCA AaagauGGCG Acuuga---- -

non-anchored alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: RNA alignment

aa----CCCC AGC---GUAa gucgcuaucc a

cacucuCCCA AGC---GGAG Aac------- -

ccg----CCA AaagauGGCG Acuuga---- -

structural motif mis-aligned

04/21/23 Burkhard Morgenstern, Tunis 2007

Application: RNA alignment

aaCCCCAGCG UAAGUCGCUA UCca--

--CACUCUCC CAAGCGGAGA AC----

----CCGCCA AAAGAUGGCG ACuuga

3 conserved nucleotides as anchor points

04/21/23 Burkhard Morgenstern, Tunis 2007

WWW interface at GOBICS(Göttingen Bioinformatics Compute Server)

04/21/23 Burkhard Morgenstern, Tunis 2007

WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene predictions for eukaryotes

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene predictions for eukaryotes

Goal: find location and structure of protein-coding genes in eukaryotic genome sequences.

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene predictions for eukaryotes

attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagtcttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene predictions for eukaryotes

attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagtcttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene predictions for eukaryotes

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene predictions for eukaryotes

Three different approaches to computational gene-finding:

Intrinsic: use statistical information about known genes (Hidden Markov Models)

Extrinsic: compare genomic sequence with known proteins / genes

Cross-species sequence comparison: search for similarities among genomes

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

Generative probabilistic model for sequence of observations („symbols“).

Finite set of states

States can emit symbols Transitions between states possible Sequence generated by path between states

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

Example: The occasionally dishonest casino.

3 5 6 6 6 4 6 5 1 6 5 1 2

F F U U U U U F F F F F F

Possible states:

fair (F); unfair (U); begin (B); end (E)

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

Assumptions:

Emission probabilities known; depend only on current state.

Transition probabilities known, depend only on current state

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

F

U

E B

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

3 5 6 6 6 4 6 5 1 6 5 1 2 s

B F F U U U U U F F F F F F E φ

For sequence s and parse φ:

P(φ) probability of φ P(φ,s) joint probability of φ and s = P(φ) * P(s|φ) P(φ|s) a-posteriori probability of φ

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

3 5 6 6 6 4 6 5 1 6 5 1 2

B F F U U U U U F F F F F F E

Goal: find path φ with maximum a-posteriori probability P(φ|s)

Idea: find path that maximizes joint probability P(φ,s) by dynamic programming (Viterbi algorithm)

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

Application to gene prediction:

A T A A T G C C T A G T C s (DNA) Z Z Z E E E E E E I I I I φ (parse)

Introns, exons etc modeled as states in GHMM („generalized HMM“)

Given sequence s, find parse that maximizes P(φ|s)

(S. Karlin and C. Burge, 1997)

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

Application to gene prediction:

A T A A T G C C T A G T C s (DNA) Z Z Z E E E E E E I I I I φ (parse)

Introns, exons etc modeled as states in GHMM („generalized HMM“)

Given sequence s, find parse that maximizes P(φ|s)

(S. Karlin and C. Burge, 1997)

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS

Basic model for GHMM-based intrinsic gene finding comparable to GenScan (M. Stanke)

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS

Features of AUGUSTUS:

Intron length model Initial pattern for exons Similarity-based weighting for splice sites Interpolated HMM Internal 3’ content model

04/21/23 Burkhard Morgenstern, Tunis 2007

Hidden-Markov-Models (HMM) for gene prediction

A T A A T G C C T A G T C s (DNA) Z Z Z E E E E I I I I φ (parse)

Explicit intron length model computationally expensive.

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS

Intron length model:

• Explicit length distribution for short introns• Geometric tail for long introns

Intron (fixed)

Exon

Intron (expl.)

Exon

Intron (geo.)

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS

Extension of AUGUSTUS using include extrinsic information:

Protein sequences EST sequences Syntenic genomic sequences User-defined constraints

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

Comparison of genomic sequences

(human and mouse)

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg

Standard score:Consider length, # matches, compute probability of random occurrence

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

Translation option:

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

Translation option:

L S Y V

catcatatc tta tct tac gtt aactcccccgt

cagtgcgtg ata gcc cat atc cgg

I A H I

DNA segments translated to peptide segments; fragment score based on peptide similarity:

Calculate probability of finding a fragment of the same length with (at least) the same sum of BLOSUM values

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

P-fragment (in both orientations)

L S Y V

catcatatc tta tct tac gtt aactcccccgt

cagtgcgtg ata gcc cat atc cgg

I A H I

N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg

For each fragment f three probability values calculated; Score of f based on smallest P value.

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

P-fragment (in both orientations)

L S Y V

catcatatc tta tct tac gtt aactcccccgt

cagtgcgtg ata gcc cat atc cgg

I A H I

N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg

P-fragments associated with strand and reading frame!

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

AGenDA: Alignment-based Gene Detection Algorithm

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

Fragments in DIALIGN alignment

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

Build cluster of fragments

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

Identify conserved splice sites

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

•Candidate exons bounded by conserved splice sites •Find optimal chain of candidate exons

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

0%10%20%30%40%50%60%70%80%90%

100%

sensitivity specificity

AGenDAGenScan

04/21/23 Burkhard Morgenstern, Tunis 2007

Gene prediction by phylogenetic footprinting

AGenDA

GenScan

64 %

12 % 17 %

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Extended GHMM using extrinsic information

Additional input data: collection h of `hints’ about possible gene structure φ for sequence s

Consider s, φ and h result of random process. Define probability P(s,h,φ)

Find parse φ that maximizes P(φ|s,h) for given s and h.

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Hints created using

Alignments to EST sequences Alignments to protein sequences Combined EST and protein alignment (EST

alignments supported by protein alignments) Alignments of genomic sequences User-defined hints

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Alignment to EST: hint to (partial) exon

EST

G1

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

EST alignment supported by protein: hint to exon (part), start codon

EST

G1

Protein

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Alignment to ESTs, Proteins: hints to introns, exons

ESTs, Protein

G1

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Alignment of genomic sequences: hint to (partial) exon

G2

G1

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Consider different types of hints:

type of hints: start, stop, dss, ass, exonpart, exon, introns

Hint associated with position i in s (exons etc. associated with right end position) max. one hint of each type allowed per position in s Each hint associated with a grade g that indicates its source or reliability.

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

hi,t = information about hint of type t at position i

hi,t = $ if no hint of type t available at i

hi,t = [grade, strand, (length, reading frame)] if hint available

(hints created by protein alignments or DIALIGN contain information about reading frame)

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Standard program version, without hints

A T A A T G C C T A G T C s (sequence) Z Z Z E E E E E E I I I I φ (parse)

Find parse that maximizes P(φ|s)

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

AUGUSTUS+ using hints

A T A A T G C C T A G T C s (sequence) $ $ $ $ $ $ $ X $ $ $ $ $ h (type 1) $ $ $ $ $ $ $ $ $ $ $ $ $ h (type 2) $ $ $ $ X $ $ $ $ $ $ $ $ h (type 3) . . . .

Z Z Z E E E E E E I I I I φ (parse)

Find parse that maximizes P(φ|s,h)

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

As in standard HMM theory: maximize joint probability P(φ,s,h)

How to define P(φ,s,h) ?

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).

),|(),(),,( shPsPhsP

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).

),|(),(),,( shPsPhsP

ti

ti shPshPsP,

, ),|(),|(),(

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Assumption: P(hi,t |φ,s) depends on type t, grade g and whether hi,t is compatible with φ or s.

Example: hi,t hint to exon E

hi,t compatible with parse φ if E part of φ.

hi,t compatible with sequence s if start and stop codons exist according to E and if no internal stop codon in E exists

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

For given g and t: 3 possible values for P(hi,t |φ,s)

P(hi,t |φ,s) = q+(t,g) if hi,t compatible with φ

P(hi,t |φ,s) = q-(t,g) if hi,t compatible with s

but not compatible with φP(hi,t |φ,s) = 0 if hi,t not compatible with s

Values learned from training data

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Results:

Gene (sub-)structures supported by hints receive bonus compared to non-supported structures

Gene (sub-)structures not supported by hints receive malus

(M. Stanke et al. 2006, BMC Bioinformatics)

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

h, h’ collections of hints;

h’i,t = hi,t for (i,t) ≠ (I,T)

h’I,T ≠ hI,T = $; g grade of h’I,T

φ+, φ- gene structures on s

h’IT compatible with φ+, but not with φ-

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

),|'(),(

),|'(),(

)',,(

)',,(

)',|(

)',|(

shPsP

shPsP

hsP

hsP

hsP

hsP

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

),|'(),(

),|'(),(

)',,(

)',,(

)',|(

)',|(

shPsP

shPsP

hsP

hsP

hsP

hsP

titi

titi

shPsP

shPsP

,,

,,

),|'(),(

),|'(),(

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

),|'(),(

),|'(),(

)',,(

)',,(

)',|(

)',|(

shPsP

shPsP

hsP

hsP

hsP

hsP

titi

titi

shPsP

shPsP

,,

,,

),|'(),(

),|'(),(

ti TI

TIti

TI

TI

titi

shP

shPshPsP

shP

shPshPsP

, ,

,,

,

,

,,

),|(

),|'(),|(),(

),|(

),|'(),|(),(

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

ti TI

TIti

TI

TI

titi

shP

shPshPsP

shP

shPshPsP

, ,

,,

,

,

,,

),|(

),|'(),|(),(

),|(

),|'(),|(),(

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

ti TI

TIti

TI

TI

titi

shP

shPshPsP

shP

shPshPsP

, ,

,,

,

,

,,

),|(

),|'(),|(),(

),|(

),|'(),|(),(

),|$(

),(),|(),(

),|$(

),(),|(),(

,

,

shP

gTqshPsP

shP

gTqshPsP

TI

TI

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

),|$(

),(),|(),(

),|$(

),(),|(),(

,

,

shP

gTqshPsP

shP

gTqshPsP

TI

TI

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

),|$(

),(),|(),(

),|$(

),(),|(),(

,

,

shP

gTqshPsP

shP

gTqshPsP

TI

TI

),|$(),(

),|$(),(

),|(

),|(

,

,

shPgTq

shPgTq

hsP

hsP

TI

TI

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Result:

i.e. structure φ+, which is compatible with additional hint h’IT receives relative bonus

),|$(),(

),|$(),(

,

,

shPgTq

shPgTq

TI

TI

),|(),(

),|(),(

),|(

),|(

),|(

),|(

,

,

shPgTq

shPgTq

hsP

hsP

hsP

hsP

TI

TI

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Results (gene level) on data set sag178

% SN % SP

Augustus 42 38

GenScan 18 14

GeneID 17 17

HMMGene 20 7

Aug. + EST 49 46

Aug. + prot 71 68

Aug. combined 68 65

Aug. all 82 79

GenomeScan 37 38

TwinScan 20 25

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Using hints from DIALIGN alignments:

1. Obtain large human/mouse sequence pairs (up to 50kb) from UCSC

2. Run CHAOS to find anchor points3. Run DIALIGN using CHAOS anchor points4. Create hints h from DIALIGN fragments5. Run AUGUSTUS with hints

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Hints from DIALIGN fragments:

Segment covered by peptide fragment minus 33 bp at both ends defines exon part hint on all 6 reading frames.

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

Hints from DIALIGN fragments:

Consider fragments with score ≥ 20

Distinguish high scores (≥ 45) from low scores Consider reading frame given by DIALIGN Consider strand given by DIALIGN

=> 2*2*2 = 8 grades

04/21/23 Burkhard Morgenstern, Tunis 2007

AUGUSTUS+

AUGUSTUS best ab-initio method at EGASP

04/21/23 Burkhard Morgenstern, Tunis 2007

EGASP test results

AUGUSTUS

GENSCAN

geneid GeneMark.hmm

Genezilla

0

10

20

30

40

50

60

70

80

90

100 Nukleotid Level

Sensitivität

Spezifität

04/21/23 Burkhard Morgenstern, Tunis 2007

EGASP test results

AUGUSTUS

GENSCAN

geneid GeneMark.hmm

Genezilla

0

10

20

30

40

50

60

70

80

90

100 Exon Level

Sensitivität

Spezifität

04/21/23 Burkhard Morgenstern, Tunis 2007

EGASP test results

AUGUSTUS

GENSCAN

geneid GeneMark.hmm

Genezilla

0

2,5

5

7,5

10

12,5

15

17,5

20

22,5

25

27,5

30 Transkript Level

Sensitivität

Spezifität

04/21/23 Burkhard Morgenstern, Tunis 2007

EGASP test results

AUGUSTUS

GENSCAN

geneid GeneMark.hmm

Genezilla

0

2,5

5

7,5

10

12,5

15

17,5

20

22,5

25

27,5

30 Gen Level

Sensitivität

Spezifität

04/21/23 Burkhard Morgenstern, Tunis 2007

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Sn Sp Sn Sp Sn Sp Sn Sp

Base Exon Transcript Gene

Ac

cu

rac

y

AUGUSTUS

AUGUSTUS+DIALIGN

DOGFISH-C

SGP2

TWINSCAN

TWINSCAN-MARS

N-SCAN

EGASP test results

04/21/23 Burkhard Morgenstern, Tunis 2007

Ongoing projects

Brugia malayi (TIGR)

Aedes aegypti (TIGR)

Schistosoma mansoni (TIGR)

Tetrahymena thermophilia (TIGR)

Galdieria Sulphuraria (Michigan State Univ.)

Coprinus cinereus (Univ. Göttingen)

Tribolium castaneum (Univ. Göttingen)