6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching Burkhard Morgenstern...
-
date post
15-Jan-2016 -
Category
Documents
-
view
217 -
download
0
Transcript of 6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching Burkhard Morgenstern...
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple Alignment and Motif Searching
Burkhard Morgenstern
Universität Göttingen
Institute of Microbiology and Genetics
Department of Bioinformatics
Tunis, March 2007
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple Alignment and Motif Searching
http://www.gobics.de/
burkhard/teaching/tunis_07.php
04/21/23 Burkhard Morgenstern, Tunis 2007
www.gobics.de/burkhard/teaching/tunis_07.php
04/21/23 Burkhard Morgenstern, Tunis 2007
Information flow in the cell
04/21/23 Burkhard Morgenstern, Tunis 2007
Information flow in the cell
Idea:
Sequence -> Structure -> Function
04/21/23 Burkhard Morgenstern, Tunis 2007
Information flow in the cell
04/21/23 Burkhard Morgenstern, Tunis 2007
Information flow in the cell
gap between sequence and structure/function data
Lots of data available at the sequence level
Fewer data at the structure and function level
04/21/23 Burkhard Morgenstern, Tunis 2007
Exponential growth of data bases
04/21/23 Burkhard Morgenstern, Tunis 2007
Major goal of bioinformatics: close the gap between sequence information and structure/function information
Most important tool for sequence analysis: sequence comparison
Simple approach: dot plot, more advanced approach: sequence alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Gibbs and McIntyre (1970)
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y
Two sequences to be compared
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y
Comparison matrix
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y I V A R E A Q Y E C I X V M R E Q Y
Search pairs of identical residues
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
Dot plot: dot (X) for all pairs of identical residues
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
Homologies as diagonal lines from top-left to bottom-right corner
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
Inversions as diagonals from bottom left to top right
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X
Repeats as parallel diagonals
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
Advantages:
1. Various types of similarity detectable (repeats, inversions)
2. Useful for large-scale analysis
Use filtering for long sequeces: dots represent matching segments instead of matching single residues
04/21/23 Burkhard Morgenstern, Tunis 2007
The dot plot
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
Evolutionary or structurally related sequences:
alignment possible
Sequence homologies represented by inserting gaps
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E C I V M R E A Q Y
Two input sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E C I V M R E A Q Y
Comparison matrix for two sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E C I X V X M R X E X X A X Q X Y X X Dot plot for two sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E C I X V X M R X E X X A X Q X Y X X
Similarities in same relative order over entire seqences
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E C I X V X M R X E X A Q X Y X
Global alignment of sequences possible
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X
Alignment corresponds to path through comparison matrix
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X
Matches (red), mis-matches (green), gaps (blue)
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X
Matches (red), mis-matches (green), gaps (blue)
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
(global) alignment: write sequences on top of each other, gaps represented by dash symbols
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E
C I V M R E A Q Y
Input sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
C - I V M R E A Q Y –
alignment of input sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
C - I V M R E A Q Y -
alignment consists matches (red), mismatches (green) and gaps (blue)
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
C - I V M R E A Q Y –
Basic task:
Find ‘best’ alignment of two sequences
= alignment that reflects structural and evolutionary relations
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
C - I V M R E A Q Y –
Questions:
1. What is a good alignment?
2. How to find the best alignment?
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
C - I V M R E A Q Y –
Idea: consider alignment as hypothesis about evolution of sequences.
gaps correspond to insertions/deletions mismatches correspond to substitutions
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
C - I V M R E A - Q Y
Problem: astronomical number of possible alignments
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E Q Y E
C I - V M R E A Q Y
Problem: astronomical number of possible alignments
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
- C I V M R E A Q Y –
Problem: astronomical number of possible alignments
stupid computer has to find out: which alignment is best ??
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
- C I V M R E A Q Y –
First (simplified) rules:
1. minimize number of mismatches
2. maximize number of matches
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
- C I V M R E A Q Y –
General assumption: sequences not too distantly related.
In this case: mismatches (substitutions) and gaps (insertions/deletions) unlikely
Consequence: good alignment should reduce gaps and mismatches
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
C I - V M R E A Q Y –
First (simplified) rules:
1. minimize number of mismatches
2. maximize number of matches
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
- C I V M R E A Q Y –
First (simplified) rules:
1. minimize number of mismatches
2. maximize number of matches
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
- C I V M R E A Q Y –
First (simplified) rules:
1. minimize number of mismatches
2. maximize number of matches
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E - Q Y E
C I - V M R E A Q Y –
Second (simplified) rule:
minimize number of gaps
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V - A R E - Q Y E
C I - V M - R E A Q Y –
Second (simplified) rule:
minimize number of gaps
Parsimony principle: minimize number of evolutionary events
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
For protein sequences: different degrees of similarity among amino
acids. counting matches/mismatches
oversimplistic
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V
T L V
Protein sequences to be aligned
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V
T L - V
Possible alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V
T - L V
Alternative alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V
T - L V
Some amino acid residues are more similar to each other than others
Therefore: similarity among amino acid residues has to be taken into account.
04/21/23 Burkhard Morgenstern, Tunis 2007
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V
T - L V
To assess quality of protein alignments:
use similarity scores for amino acids
s(a,b) similarity score for amino acids a and b
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
Similarity measured by substitution matrices based on substitution probabilities
Important substitution matrices:
PAM (M. Dayhoff) BLOSUM (S. Henikoff / J. Henikoff)
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
The PAM matrix:
Consider probability pa,b of substitution a → b (or b → a) for amino acids a and b
Define for amino acids a and b similarity score S(a,b) based on probability pa,b
First task: find out pa,b for every pair of amino acids a, b
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
The PAM matrix:
Use closely related protein families – no alignment problem, no double substitutions
Construct phylogenetic tree with parsimony method
Count substitution frequencies/probabilities Normalize substitution probabilities Extrapolate probabilities for larger
evolutionary distances
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
Finally: define similarity score
S(a,b) = log (pa,b / qa qb)
qa = (relative) frequency of amino acid a
04/21/23 Burkhard Morgenstern, Tunis 2007
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V
T - L V
Given a similarity score s(a,b) for pairs of amino acids, define quality score of alignment as:
sum of similarity values s(a,b) of aligned residues
minus gap penalty g for each residue aligned with a gap
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V
T - L V
Example:
Score = s(T,T) + s(I,L) + s (V,V) - g
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V T - L V
Next question: find alignment with best score
Dynamic-programming algorithm finds alignment with best score.
(Needleman and Wunsch, 1970)
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E A Q Y E
- C I V M R E - Q Y –
Alignment corresponds to path through comparison matrix
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E A Q Y E X X C X I X V X M X R X E X X Q X Y X X
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T Y I V A R E A Q Y E
- C I V M R E - Q Y –
Alignment corresponds to path through comparison matrix
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V - R E A Q I - C I V M R E - H Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
Score of alignment: Sum of similarity values of aligned residues minus gap penatly
T W L V - R E A Q I - C I V M R E - H Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
Example: S = - g + s(W,C) + s(L,L) + s(V,V) - g + s(R,R) …
T W L V - R E A Q I - C I V M R E - H Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X
T W L V - R E A Q I - C I V M R E - H Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R E A Q Y I X X C X Alignment corresponds I X to path through V X comparison matrix M X R X E X X H X Y X X
T W L V - R E A Q I - C I V M R E - H Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
i T W L V R E A Q Y I X X Dynamic programming: C X Calculate scores S(i,j) I X of optimal alignment of V X prefixes up to positions M X i and j. j R X E H Y
T W L V - R - C I V M R
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
i T W L V R E A Q Y I X X C X S(i,j) can be calculated from I X possible predecessors V X S(i-1,j-1), S(i,j-1), S(i-1,j). M X j R X E H Y
T W L V - R - C I V M R
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from top left = V X M X S(i-1,j-1) + s(R,R) j R X E H Y
T W L V - R - C I V M R
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from above = V X j-1M X S(i,j-1) – g j R X E H Y
T W L V R - - C I V M R
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
i-1 i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from left = V X M X S(i-1,j) – g j R X X E H Y
T W L - - V R - C I V M R -
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
i-1 i T W L V R E A Q Y I X X C X Score of optimal path = I X V X Maximum of these three M X values j R X X E H Y
T W L - - V R - C I V M R -
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
Recursion formula for global alignment:
For sequences x and y
gijS
gjiS
yxsjiS
jiS
ji
)1,(
),1(
),()1,1(
max),(
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R C I V M R E H Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x C x x x I x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x C x x x I x x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x C x x x I x x x V x x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x C x x x I x x x V x x x M x x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x H x x Y x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x Y x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x x C x x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x x C x x x x I x x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x x C x x x x I x x x x V x x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Fill matrix from top left to bottom right:
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Find optimal alignment by trace-back procedure
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R x x x x x x C x I x V x M x R x E x H x Y x Initial matrix entries?
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
i
T W L V R
X X
C X Entries S(i,j) scores
I X of optimal alignment of
j V X prefixes up to positions
M i and j.
R
E
H
Y
T W L V
- C I V
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
i T W L V R j X X X X X C Entries S(i,0) scores I of optimal alignment of V prefix up to positions M i and empty prefix. R E Score = - i* g H Y T W L V - - - -
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R C I V M R E H Y Initial matrix entries: Example, g = 2
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
T W L V R 0 -2 -4 -6 -8 -10 C -2 I -4 V -6 M -8 R -10 E -12 H -14 Y -16 Initial matrix entries: Example, g = 2
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise global alignment
T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X
T W L V - R E A Q I - C I V M R E - F Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise global alignment
Computational complexity: how does program run time and memory depend on size of input data?
l1 and l2 length of sequences:Computing time and memory proportional to
l1 * l2
Time and memory complexity = O(l1 * l2)
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
More realistic gap penalty: affine-linear instead of linear
Penalty for gap of length l:
c0 + (l-1)* c1
c0 = ‘gap-opening penalty’
c0 = ‘gap-extension penalty’
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
So far: global alignment considered: sequences aligned over their entire length.
But: sequences often share only local sequence similarity (conserved genes or domains)
Most important application: database searching
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X
T W L V - R E A Q I - C I V M R E - F Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X
T W L V - R E A Q I - C I V M R E - F Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
Problem:
Find pair of segments with maximal alignment score (not necessarily part of optimal global alignment!)
Equivalent: find path starting and ending anywhere in the matrix.
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X
T W L V - R E A Q I - C I V M R E - F Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
Recursion formula for global alignment:
S(i,j) = max { S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
Recursion formula for local alignment:
S(i,j) = max { 0 , S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
T W L V R 0 0 0 0 0 0 C 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 Initial matrix entries = 0
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
T W L V R 0 0 0 0 0 0 C 0 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 s(C,T) = -2
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
Recursion formula for global alignment:
gijS
gjiS
yxsjiS
jiS
ji
)1,(
),1(
),()1,1(
max),(
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
Recursion formula for local alignment:
0
)1,(
),1(
),()1,1(
max),(gijS
gjiS
yxsjiS
jiS
ji
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise sequence alignment
For trace-back:
Store positions imax and jmax with
S(imax ,jmax) maximal
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X
T W L V - R E A Q I - C I V M R E - F Y
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
Algorithm by Smith and Waterman (1983)
Implementation: e.g. BestFit in GCG package
04/21/23 Burkhard Morgenstern, Tunis 2007
Pair-wise local alignment
Complexity: l1 and l2 length of sequences:computing time
and memory proportional to l1 * l2
Time and space complexity = O(l1 * l2)
Too slow for data base searching! Therefore tools like BLAST necessary for
database searching
04/21/23 Burkhard Morgenstern, Tunis 2007
The Basic Local Alignment Search Tool (BLAST)
New BLAST version (1997)
Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST
(PSI BLAST)
04/21/23 Burkhard Morgenstern, Tunis 2007
The Basic Local Alignment Search Tool (BLAST)
PSI BLAST:
1. search database with standard BLAST
2. take best hits and create multiple alignment
3. calculate profile from multiple alignment
4. search database again with profile as query
04/21/23 Burkhard Morgenstern, Tunis 2007
The Basic Local Alignment Search Tool (BLAST)
04/21/23 Burkhard Morgenstern, Tunis 2007
The Basic Local Alignment Search Tool (BLAST)
profile for sequence family or motif:
table of amino acid/nucleotide frequencies at any position in alignment.
04/21/23 Burkhard Morgenstern, Tunis 2007
The Basic Local Alignment Search Tool (BLAST)
Profile: frequencies of nucleotides at every position.
seq1 A T T G – A T
seq2 C T T G T A G
seq3 A - - G T A T
seq4 A T G G T G T
seq5 A C T G T A C
A 80 0 0 0 0 80 0
T 0 75 75 0 100 0 60
C 20 25 0 0 0 0 20
G 0 0 25 100 0 20 20
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 T Y I M R E A Q Y E S A Q
s2 T C I V M R E A Y E
s3 Y I M Q E V Q Q E R
s4 W R Y I A M R E Q Y E
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 - T Y I - M R E A Q Y E S A Q
s2 - T C I V M R E A - Y E - - -
s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 - T Y I - M R E A Q Y E S A Q
s2 - T C I V M R E A - Y E - - -
s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 - T Y I - M R E A Q Y E S A Q
s2 - T C I V M R E A - Y E - - -
s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 - T Y I - M R E A Q Y E S A Q
s2 - T C I V M R E A - Y E - - -
s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
General information in multiple alignment: Functionally important regions more conserved than
non-functional regions Local sequence conservation indicates functionality!
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 - T Y I - M R E A Q Y E S A Qs2 - T C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
For phylogeny reconstruction: Estimate pairwise distances between sequences
(distance-based methods for tree reconstruction) Estimate evloutionary events in evolution (parsimony
and maximum likelihood methods)
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 - T Y I - M R E A Q Y E S A Q
s2 - T C I V M R E A - Y E - - -
s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Astronomical number of possible alignments!
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 - T Y I - M R E A Q Y E S A Q
s2 - T C I V M R E A - - - Y E -
s3 Y I - - - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Astronomical number of possible alignments!
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
s1 - T Y I - M R E A Q Y E S A Q
s2 - T C I V M R E A - - - Y E -
s3 Y I - - - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Computer has to decide: which one is best??
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
Questions in development of multiple-alignment programs (as in pairwise alignment):
(1) What is a good alignment? → objective function (`score’)
(2) How to find a good alignment? → optimization algorithm
First question far more important !
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
Traditional Objective functions:
Define Score of alignments as
Sum of individual similarity scores S(a,b) Gap penalties
Needleman-Wunsch scoring system (1970)
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
Traditional Objective functions
Can be generalized to multiple alignment
(e.g. sum-of-pair score, tree alignment)
Needleman-Wunsch algorithm can also be generalized to multiple alignment, but:
Very time and memory consuming!
-> Heuristics needed
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
First question: how to score multiple alignments?
Possible scoring scheme:
Sum-of-pairs score
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
Use sum of scores of these p.a.
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
Complexity:
For sequences of length l1 * l2 * l3
O( l1 * l2 * l3 )
For n sequences ( average length l ):
O( ln )
Exponential complexity!
04/21/23 Burkhard Morgenstern, Tunis 2007
Multiple sequence alignment
Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment
Optimal solution not feasible:
-> Heuristics necessary
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
Guide tree
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
Idea: align closely related sequences first!
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP-
AVVIQDNSDIKVVP--KAKIIRD
YAVESEASFQPVAALERIN
WLNYNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP-
AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------
WLN-YNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN-
WW--RLNDKEGYVPRNLLGLYP-
AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------
WLN-YNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN--------
WW--RLNDKEGYVPRNLLGLYP--------
AVVIQDNSDIKVVP--KAKIIRD-------
YAVESEA---SVQ--PVAALERIN------
WLN-YNE---ERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
“Greedy” algorithm:
Consider partial solution of bigger problem
search best partial solution, fix solution search second-best partial solution that is consistent
with first solution, fix solution Search third-best partial solution … etc.
E.g.: Rucksack-Problem
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN--------
WW--RLNDKEGYVPRNLLGLYP--------
AVVIQDNSDIKVVP--KAKIIRD-------
YAVESEA---SVQ--PVAALERIN------
WLN-YNE---ERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
04/21/23 Burkhard Morgenstern, Tunis 2007
`Progressive´ Alignment
Most important software program:
CLUSTAL W:J. Thompson, T. Gibson, D. Higgins (1994), CLUSTAL
W: improving the sensitivity of progressive multiple sequence alignment … Nuc. Acids. Res. 22, 4673 - 4680
(~ 18.000 citations in the literature)
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
Problems with traditional approach:
Results depend on gap penalty
Heuristic guide tree determines alignment;
alignment used for phylogeny reconstruction
Algorithm produces global alignments.
04/21/23 Burkhard Morgenstern, Tunis 2007
Tools for multiple sequence alignment
Problems with traditional approach:
But:
Many sequence families share only local similarity
E.g. sequences share one conserved motif
04/21/23 Burkhard Morgenstern, Tunis 2007
Local sequence alignment
Find common motif in sequences; ignore the rest
EYENS
ERYENS
ERYAS
04/21/23 Burkhard Morgenstern, Tunis 2007
Local sequence alignment
Find common motif in sequences; ignore the rest
E-YENS
ERYENS
ERYA-S
04/21/23 Burkhard Morgenstern, Tunis 2007
Local sequence alignment
Find common motif in sequences; ignore the rest – Local alignment
E-YENSERYENSERYA-S
04/21/23 Burkhard Morgenstern, Tunis 2007
Local sequence alignment
Important methods for local multiple alignment:
•PIMA•MEME/MAST
Idea: expectation maximation.
04/21/23 Burkhard Morgenstern, Tunis 2007
Local sequence alignment
Traditional alignment approaches:
Either global or local methods!
04/21/23 Burkhard Morgenstern, Tunis 2007
New question: sequence families with multiple local similarities
Neither local nor global methods appliccable
04/21/23 Burkhard Morgenstern, Tunis 2007
New question: sequence families with multiple local similarities
Alignment possible if order conserved
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Morgenstern, Dress, Werner (1996),PNAS 93, 12098-12103
Combination of global and local methods
Assemble multiple alignment from gap-free local pair-wise alignments (,,fragments“)
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
Consistency!
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------TAATAGTTAaactccccCGTGC-TTag
cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
caaa--GAGTATCAcc----------CCTGaaTTGAATaa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Score of an alignment:
Define score of fragment f:
l(f) = length of fs(f) = sum of matches (similarity values)
P(f) = probability to find a fragment with length l(f) and at least s(f) matches in random sequences that have the same length as the input sequences.
Score w(f) = -ln P(f)
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Score of an alignment:
Define score of fragment f:
Define score of alignment as
sum of scores of involved fragments
No gap penalty!
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Score of an alignment:
Goal in fragment-based alignment approach: find
Consistent collection of fragments with maximum sum of weight scores
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaaccccctcgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc
Pair-wise alignment:
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaaccccctcgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc
Pair-wise alignment:
recursive algorithm finds optimal chain of
fragments.
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--
Pair-wise alignment:
recursive algorithm finds optimal chain of
fragments.
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--
Optimal pairwise alignment: chain of fragments with maximum sum of weights found by dynamic programming:
Standard fragment-chaining algorithm
Space-efficient algorithm
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Multiple alignment:
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Multiple alignment:
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaccctgaattgaagagtatcacataa
(1) Calculate all optimal pair-wise alignments
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Multiple alignment:
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
(1) Calculate all optimal pair-wise alignments
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Multiple alignment:
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
(1) Calculate all optimal pair-wise alignments
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Fragments from optimal pair-wise alignments might be inconsistent
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Fragments from optimal pair-wise alignments might be inconsistent
1. Sort fragments according to scores
2. Include them one-by-one into growing multiple alignment – as long as they are consistent
(greedy algorithm, comparable to knapsack problem)
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Consistency problem
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Consistency problem
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Upper and lower bounds for alignable positions
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
Upper and lower bounds for alignable positions
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagt taaactcccccgtgcttag
Cagtgcgtgtattact aacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
Upper and lower bounds for alignable positions
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taata-----gttaaactcccccgtgcttag
Cagtgcgtgtatta-----ctaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
Upper and lower bounds for alignable positions
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Upper and lower bounds for alignable positions
site x = [i,p] (sequence i, position p)
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Upper and lower bounds for alignable positions
Calculate upper bound bl(x,i) and lower bound bu(x,i) for each x and sequence i
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Upper and lower bounds for alignable positions
bl(x,i) and bu(x,i) updated for each new fragment in alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Consistency bounds are to be updated for each new fragment that is included in to the growing Alignment
Efficient algorithm
(Abdeddaim and Morgenstern, 2002)
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
Advantages of segment-based approach:
Program can produce global and local alignments!
Sequence families alignable that cannot be aligned with standard methods
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
DIALIGN is available
Online at BiBiServ (Bielefeld Bioinformatics Server)
Downloadable UNIX/LINUX executables at BiBiServ
Source code (email to BM)
04/21/23 Burkhard Morgenstern, Tunis 2007
Program input
Program usage:
> dialign2-2 [options] <input_file>
<input_file> = multi-sequence file in FASTA-format
04/21/23 Burkhard Morgenstern, Tunis 2007
Program output
DIALIGN 2.2.1 ************* Program code written by Burkhard Morgenstern and Said Abdeddaim e-mail contact: [email protected] Published research assisted by DIALIGN 2 should cite: Burkhard Morgenstern (1999). DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211 - 218.
For more information, please visit the DIALIGN home page at
http://bibiserv.techfak.uni-bielefeld.de/dialign/
program call: ./dialign2-2 -nt -anc s
Aligned sequences: length: ================== ======= 1) dog_il4 300 2) bla 200 3) blu 200
Average seq. length: 233.3
Please note that only upper-case letters are considered to be aligned.
04/21/23 Burkhard Morgenstern, Tunis 2007
Program output
Alignment (DIALIGN format): =========================== dog_il4 1 cagg------ ----GTTTGA atctgataca ttgc------ ---------- bla 1 ctga------ ---------- ---------- --------GC CAAGTGGGAA blu 1 ttttgatatg agaaGTGTGA aacaagctat cctatattGC TAAGTGGCAG 0000000000 0000000000 0000000000 0000000011 1111111111 dog_il4 25 ---------- --ATGGCACT GGGGTGAATG AGGCAGGCAG CAGAATGATC bla 17 ggtgtgaata catgggtttc cagtaccttc tgaggtccag agtacc---- blu 51 ccctggcttt ctATGTGCAC AGAATGGGAG GAAAGTGCCT GCTAGTGAGC 0000000000 0000000000 0000000000 0000000000 0000000000 dog_il4 63 GTACTGCAGC CCTGAGCTTC CACTGGCCCA TGTTGGTATC CTTGTATTTT bla 63 ---------- ---------- ---TTTCCCA TGTGCTCCAT GGTGGAATGG blu 101 CAGGGACTCA GAGAGAATGG AGTATAGGGG TCAGGGCat- ---------- 0000000000 0000000000 0009999999 9999999888 8888888888 dog_il4 113 TCCGCCCCTT CCCAGCACca gcattatcct ---GGGATTG GAGAAGGGGG bla 90 ACCACTCCTT CTCAGCACaa caaagcccaa gaaGGTGTTG CGTTCTAGAC blu 140 ---------- ---------- ---------- ---GGGGTGG CCTTAGGCTC 8888888888 8888888800 0000000000 0007777777 7777777777
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaac----------ggttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------TAATAGTTAaactccccCGTGC-TTag------
cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
caaa--GAGTATCAcc----------CCTGaaTTGAATaa--
04/21/23 Burkhard Morgenstern, Tunis 2007
The DIALIGN approach
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
Program output
Alignment (DIALIGN format): =========================== dog_il4 1 cagg------ ----GTTTGA atctgataca ttgc------ ---------- bla 1 ctga------ ---------- ---------- --------GC CAAGTGGGAA blu 1 ttttgatatg agaaGTGTGA aacaagctat cctatattGC TAAGTGGCAG 0000000000 0000000000 0000000000 0000000011 1111111111 dog_il4 25 ---------- --ATGGCACT GGGGTGAATG AGGCAGGCAG CAGAATGATC bla 17 ggtgtgaata catgggtttc cagtaccttc tgaggtccag agtacc---- blu 51 ccctggcttt ctATGTGCAC AGAATGGGAG GAAAGTGCCT GCTAGTGAGC 0000000000 0000000000 0000000000 0000000000 0000000000 dog_il4 63 GTACTGCAGC CCTGAGCTTC CACTGGCCCA TGTTGGTATC CTTGTATTTT bla 63 ---------- ---------- ---TTTCCCA TGTGCTCCAT GGTGGAATGG blu 101 CAGGGACTCA GAGAGAATGG AGTATAGGGG TCAGGGCat- ---------- 0000000000 0000000000 0009999999 9999999888 8888888888 dog_il4 113 TCCGCCCCTT CCCAGCACca gcattatcct ---GGGATTG GAGAAGGGGG bla 90 ACCACTCCTT CTCAGCACaa caaagcccaa gaaGGTGTTG CGTTCTAGAC blu 140 ---------- ---------- ---------- ---GGGGTGG CCTTAGGCTC 8888888888 8888888800 0000000000 0007777777 7777777777
04/21/23 Burkhard Morgenstern, Tunis 2007
T-COFFEE
C. Notredame, D. Higgins, J. Heringa (2000), T-Coffee: A novel algorithm for multiple sequence alignment, J. Mol. Biol.
04/21/23 Burkhard Morgenstern, Tunis 2007
T-COFFEE
Problem with “progressive” approaches:
Strictly global alignments
Use only pair-wise comparison
04/21/23 Burkhard Morgenstern, Tunis 2007
T-COFFEE
Idea: Start with local and global pair-wise alignments (“primary
library” of alignments)
Construct “scondary library” of residues that are indirectly aligned by primary library.
Re-score residue pairs
Construct final alignment with “progressive” method
04/21/23 Burkhard Morgenstern, Tunis 2007
T-COFFEE
Advantage:
Combination of local and global approaches
Less sensitive against mis-alignments in progressive proceedure
04/21/23 Burkhard Morgenstern, Tunis 2007
T-COFFEE
04/21/23 Burkhard Morgenstern, Tunis 2007
04/21/23 Burkhard Morgenstern, Tunis 2007
T-COFFEE
T-COFFEE and DIALIGN: Less sensitive to spurious pairwise similarities Can handle local homologies better than
CLUSTAL
04/21/23 Burkhard Morgenstern, Tunis 2007
Most multi-alignment approaches automated, i.e. based on algorithmic rules. Two components:
Objective function: assess alignment quality
Optimization algorithm: find optimal or near-optimal alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Fully automated alignment programs necessary f no expert knowledge available if large amounts of data to be analyzed
But: Often no biologically reasonable
results Often additional information about
homologies etc. available
04/21/23 Burkhard Morgenstern, Tunis 2007
Idea for improved alignment
Use expert knowledge to influence alignment procedure
DIALIGN with user-defined anchor points
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
Alignment of large genomic sequences to identify functional elements (phylogenetic footprinting)
Göttgens et al., 2000, 2001, 2002, … Pollard et al., 2004
DIALIGN, MGA, PipMaker, LAGAN, AVID, Mummer, WABA, …
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
Gene-regulatory sites identified by mulitple sequence alignment (phylogenetic footprinting)
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
DIALIGN alignment of human and murine genomic sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
DIALIGN alignment of tomato and Thaliana genomic sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)
Alignment of Hox gene cluster:
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)
Alignment of Hox gene cluster:
DIALIGN able to identify small regulatory elements, but
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)
Alignment of Hox gene cluster:
DIALIGN able to identify small regulatory elements, but
Entire genes totally mis-aligned
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)
Alignment of Hox gene cluster:
DIALIGN able to identify small regulatory elements, but
Entire genes totally mis-aligned Reason for mis-alignment: duplications !
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
The Hox gene cluster:
4 Hox gene clusters in pufferfish. 14 genes, different genes in different clusters!
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of large genomic sequences
The Hox gene cluster:
Complete mis-alignment of entire genes!
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Conserved motivs; no similarity outside motifs
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in two sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in two sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in two sequences
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Mis-alignment would have lower score!
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in one sequence
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in one sequence
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in one sequence
Possible mis-alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in one sequence
S3
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in one sequence
S3
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in one sequence
S3
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Duplication in one sequence
S3
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Consistency problem
S3
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
More plausible alignment – and higher score:
S3
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Consistency problem
S3
04/21/23 Burkhard Morgenstern, Tunis 2007
Alignment of sequence duplications
S1
S2
Alternative alignment; probably biologically wrong;lower numerical score!
S3
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
Biologically meaningful alignment not possible by automated approaches.
Idea: use expert knowledge to guide alignment procedure
User defines a set anchor points that are to be „respected“ by the alignment procedure
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN
IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN
IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN
IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
Use known homology as anchor point
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN
IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
Use known homology as anchor point
Anchor point = anchored fragment (gap-free pair of segments)
Remainder of sequences aligned automatically
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN
IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
Alignment of anchored positions a and b not enforced – a and b may be un-aligned –, but:
a is only residue that can be aligned to b
Residues left of a aligned with residues left of b
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
-------NLF VALYDFVASG DNTLSITKGE klrvlgynhn
iihredkGVI YALWDYEPQN DDELPMKEGD cmt-------
Anchored alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN
IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS
Anchor points in multiple alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN
IIHREDKGVIYALWDYEPQND DELPMKEGDCMT
GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS
Anchor points in multiple alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Anchored sequence alignment
-------NLF V-ALYDFVAS GD-------- NTLSITKGEk lrvLGYNhn
iihredkGVI Y-ALWDYEPQ ND-------- DELPMKEGDC MT-------
-------GYQ YrALYDYKKE REedidlhlg DILTVNKGSL VA-LGFS--
Anchored multiple alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
Goal:
Find optimal alignment (=consistent set of fragments) under costraints given by user-specified anchor points!
04/21/23 Burkhard Morgenstern, Tunis 2007
Additional input file with anchor points:
1 3 215 231 5 4.5
2 3 34 78 23 1.23
1 4 317 402 8 8.5
Algorithmic questions
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMTGYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS
04/21/23 Burkhard Morgenstern, Tunis 2007
Additional input file with anchor points:
1 3 215 231 5 4.5
2 3 34 78 23 1.23
1 4 317 402 8 8.5
Algorithmic questions
04/21/23 Burkhard Morgenstern, Tunis 2007
Additional input file with anchor points:
1 3 215 231 5 4.5
2 3 34 78 23 1.23
1 4 317 402 8 8.5
Sequences
Algorithmic questions
04/21/23 Burkhard Morgenstern, Tunis 2007
Additional input file with anchor points:
1 3 215 231 5 4.5
2 3 34 78 23 1.23
1 4 317 402 8 8.5
Sequences start positions
Algorithmic questions
04/21/23 Burkhard Morgenstern, Tunis 2007
Additional input file with anchor points:
1 3 215 231 5 4.5
2 3 34 78 23 1.23
1 4 317 402 8 8.5
Sequences start positions length
Algorithmic questions
04/21/23 Burkhard Morgenstern, Tunis 2007
Additional input file with anchor points:
1 3 215 231 5 4.5
2 3 34 78 23 1.23
1 4 317 402 8 8.5
Sequences start positions length score
Algorithmic questions
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
Requirements:
Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Inconsistent anchor points!
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaat---agttaaactcccccgtgcttag
Cagtgcgtgtattac-taacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Inconsistent anchor points!
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
Requirements:
Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points
Find alignment under constraints given by anchor points!
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
Use data structures from multiple alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Greedy procedure for multiple alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Greedy procedure for multiple alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
Question: which positions are still alignable ?
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag Si
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa x
For each position x and each sequence Si exist an
upper bound ub(x,i) and a lower bound lb(x,i) for
residues y in Si that are alignable with x
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag Si
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa x
For each position x and each sequence Si exist an
upper bound ub(x,i) and a lower bound lb(x,i) for
residues y in Si that are alignable with x
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag Si
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa x
ub(x,i) and lb(x,i) updated during greedy procedure
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag Si
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa x
Initial values of lb(x,i), ub(x,i)
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag Si
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa x
ub(x,i) and lb(x,i) updated during greedy procedure
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag Si
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa x
ub(x,i) and lb(x,i) updated during greedy procedure
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
Anchor points treated like fragments in greedy algorithm:
Sorted according to user-defined scores Accepted if consistent with previously accepted
anchors
ub(x,i) and lb(x,i) updated during greedy
procedure
Resulting values of ub(x,i) and lb(x,i) used as initial
values for alignment procedure
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag Si
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa x
Initial values of lb(x,i), ub(x,i)
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
atctaatagttaaactcccccgtgcttag Si
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa x
Initial values of lb(x,i), ub(x,i) calculated using anchor
points
04/21/23 Burkhard Morgenstern, Tunis 2007
Algorithmic questions
Ranking of anchor points to prioritize anchor points, e.g.
anchor points from verified homologies -- higher priority
automatically created anchor points (using CHAOS, BLAST, … ) -- lower priority
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: Hox gene cluster
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: Hox gene cluster
Use gene boundaries as anchor points
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: Hox gene cluster
Use gene boundaries as anchor points
+ CHAOS / BLAST hits
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: Hox gene cluster
no anchoring anchoring
Ali. Columns
2 seq 2958 3674
3 seq 668 1091
4 seq 244 195
Score 1166 1007
CPU time 4:22 0:19
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: Hox gene cluster
Example:
Teleost Hox gene cluster:
Score of anchored alignment 15 % higher than score of non-anchored alignment !
Conclusion: Greedy optimization algorithm does a bad job!
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: Improvement of Alignment programs
Two possible reasons for mis-alignments:
Wrong objective function: Biologically correct
alignment gets bad numerical score
Bad optimization algorithms: Biologically correct
alignment gets best numerical score, but algorithm
fails to find this alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: Improvement of Alignment programs
Two possible reasons for mis-alignments:
Anchored alignments can help to decide
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: RNA alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: RNA alignment
aa----CCCC AGC---GUAa gucgcuaucc a
cacucuCCCA AGC---GGAG Aac------- -
ccg----CCA AaagauGGCG Acuuga---- -
non-anchored alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: RNA alignment
aa----CCCC AGC---GUAa gucgcuaucc a
cacucuCCCA AGC---GGAG Aac------- -
ccg----CCA AaagauGGCG Acuuga---- -
structural motif mis-aligned
04/21/23 Burkhard Morgenstern, Tunis 2007
Application: RNA alignment
aaCCCCAGCG UAAGUCGCUA UCca--
--CACUCUCC CAAGCGGAGA AC----
----CCGCCA AAAGAUGGCG ACuuga
3 conserved nucleotides as anchor points
04/21/23 Burkhard Morgenstern, Tunis 2007
WWW interface at GOBICS(Göttingen Bioinformatics Compute Server)
04/21/23 Burkhard Morgenstern, Tunis 2007
WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene predictions for eukaryotes
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene predictions for eukaryotes
Goal: find location and structure of protein-coding genes in eukaryotic genome sequences.
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene predictions for eukaryotes
attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagtcttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene predictions for eukaryotes
attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagtcttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene predictions for eukaryotes
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene predictions for eukaryotes
Three different approaches to computational gene-finding:
Intrinsic: use statistical information about known genes (Hidden Markov Models)
Extrinsic: compare genomic sequence with known proteins / genes
Cross-species sequence comparison: search for similarities among genomes
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
Generative probabilistic model for sequence of observations („symbols“).
Finite set of states
States can emit symbols Transitions between states possible Sequence generated by path between states
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
Example: The occasionally dishonest casino.
3 5 6 6 6 4 6 5 1 6 5 1 2
F F U U U U U F F F F F F
Possible states:
fair (F); unfair (U); begin (B); end (E)
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
Assumptions:
Emission probabilities known; depend only on current state.
Transition probabilities known, depend only on current state
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
F
U
E B
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
3 5 6 6 6 4 6 5 1 6 5 1 2 s
B F F U U U U U F F F F F F E φ
For sequence s and parse φ:
P(φ) probability of φ P(φ,s) joint probability of φ and s = P(φ) * P(s|φ) P(φ|s) a-posteriori probability of φ
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
3 5 6 6 6 4 6 5 1 6 5 1 2
B F F U U U U U F F F F F F E
Goal: find path φ with maximum a-posteriori probability P(φ|s)
Idea: find path that maximizes joint probability P(φ,s) by dynamic programming (Viterbi algorithm)
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
Application to gene prediction:
A T A A T G C C T A G T C s (DNA) Z Z Z E E E E E E I I I I φ (parse)
Introns, exons etc modeled as states in GHMM („generalized HMM“)
Given sequence s, find parse that maximizes P(φ|s)
(S. Karlin and C. Burge, 1997)
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
Application to gene prediction:
A T A A T G C C T A G T C s (DNA) Z Z Z E E E E E E I I I I φ (parse)
Introns, exons etc modeled as states in GHMM („generalized HMM“)
Given sequence s, find parse that maximizes P(φ|s)
(S. Karlin and C. Burge, 1997)
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS
Basic model for GHMM-based intrinsic gene finding comparable to GenScan (M. Stanke)
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS
Features of AUGUSTUS:
Intron length model Initial pattern for exons Similarity-based weighting for splice sites Interpolated HMM Internal 3’ content model
04/21/23 Burkhard Morgenstern, Tunis 2007
Hidden-Markov-Models (HMM) for gene prediction
A T A A T G C C T A G T C s (DNA) Z Z Z E E E E I I I I φ (parse)
Explicit intron length model computationally expensive.
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS
Intron length model:
• Explicit length distribution for short introns• Geometric tail for long introns
Intron (fixed)
Exon
Intron (expl.)
Exon
Intron (geo.)
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS
Extension of AUGUSTUS using include extrinsic information:
Protein sequences EST sequences Syntenic genomic sequences User-defined constraints
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
Comparison of genomic sequences
(human and mouse)
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
catcatatcttatcttacgttaactcccccgt
cagtgcgtgatagcccatatccgg
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
catcatatcttatcttacgttaactcccccgt
cagtgcgtgatagcccatatccgg
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
catcatatcttatcttacgttaactcccccgt
cagtgcgtgatagcccatatccgg
Standard score:Consider length, # matches, compute probability of random occurrence
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
Translation option:
catcatatcttatcttacgttaactcccccgt
cagtgcgtgatagcccatatccgg
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
Translation option:
L S Y V
catcatatc tta tct tac gtt aactcccccgt
cagtgcgtg ata gcc cat atc cgg
I A H I
DNA segments translated to peptide segments; fragment score based on peptide similarity:
Calculate probability of finding a fragment of the same length with (at least) the same sum of BLOSUM values
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
P-fragment (in both orientations)
L S Y V
catcatatc tta tct tac gtt aactcccccgt
cagtgcgtg ata gcc cat atc cgg
I A H I
N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg
For each fragment f three probability values calculated; Score of f based on smallest P value.
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
P-fragment (in both orientations)
L S Y V
catcatatc tta tct tac gtt aactcccccgt
cagtgcgtg ata gcc cat atc cgg
I A H I
N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg
P-fragments associated with strand and reading frame!
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
AGenDA: Alignment-based Gene Detection Algorithm
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
Fragments in DIALIGN alignment
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
Build cluster of fragments
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
Identify conserved splice sites
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
•Candidate exons bounded by conserved splice sites •Find optimal chain of candidate exons
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
0%10%20%30%40%50%60%70%80%90%
100%
sensitivity specificity
AGenDAGenScan
04/21/23 Burkhard Morgenstern, Tunis 2007
Gene prediction by phylogenetic footprinting
AGenDA
GenScan
64 %
12 % 17 %
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Extended GHMM using extrinsic information
Additional input data: collection h of `hints’ about possible gene structure φ for sequence s
Consider s, φ and h result of random process. Define probability P(s,h,φ)
Find parse φ that maximizes P(φ|s,h) for given s and h.
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Hints created using
Alignments to EST sequences Alignments to protein sequences Combined EST and protein alignment (EST
alignments supported by protein alignments) Alignments of genomic sequences User-defined hints
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Alignment to EST: hint to (partial) exon
EST
G1
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
EST alignment supported by protein: hint to exon (part), start codon
EST
G1
Protein
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Alignment to ESTs, Proteins: hints to introns, exons
ESTs, Protein
G1
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Alignment of genomic sequences: hint to (partial) exon
G2
G1
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Consider different types of hints:
type of hints: start, stop, dss, ass, exonpart, exon, introns
Hint associated with position i in s (exons etc. associated with right end position) max. one hint of each type allowed per position in s Each hint associated with a grade g that indicates its source or reliability.
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
hi,t = information about hint of type t at position i
hi,t = $ if no hint of type t available at i
hi,t = [grade, strand, (length, reading frame)] if hint available
(hints created by protein alignments or DIALIGN contain information about reading frame)
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Standard program version, without hints
A T A A T G C C T A G T C s (sequence) Z Z Z E E E E E E I I I I φ (parse)
Find parse that maximizes P(φ|s)
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
AUGUSTUS+ using hints
A T A A T G C C T A G T C s (sequence) $ $ $ $ $ $ $ X $ $ $ $ $ h (type 1) $ $ $ $ $ $ $ $ $ $ $ $ $ h (type 2) $ $ $ $ X $ $ $ $ $ $ $ $ h (type 3) . . . .
Z Z Z E E E E E E I I I I φ (parse)
Find parse that maximizes P(φ|s,h)
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
As in standard HMM theory: maximize joint probability P(φ,s,h)
How to define P(φ,s,h) ?
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
),|(),(),,( shPsPhsP
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
),|(),(),,( shPsPhsP
ti
ti shPshPsP,
, ),|(),|(),(
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Assumption: P(hi,t |φ,s) depends on type t, grade g and whether hi,t is compatible with φ or s.
Example: hi,t hint to exon E
hi,t compatible with parse φ if E part of φ.
hi,t compatible with sequence s if start and stop codons exist according to E and if no internal stop codon in E exists
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
For given g and t: 3 possible values for P(hi,t |φ,s)
P(hi,t |φ,s) = q+(t,g) if hi,t compatible with φ
P(hi,t |φ,s) = q-(t,g) if hi,t compatible with s
but not compatible with φP(hi,t |φ,s) = 0 if hi,t not compatible with s
Values learned from training data
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Results:
Gene (sub-)structures supported by hints receive bonus compared to non-supported structures
Gene (sub-)structures not supported by hints receive malus
(M. Stanke et al. 2006, BMC Bioinformatics)
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
h, h’ collections of hints;
h’i,t = hi,t for (i,t) ≠ (I,T)
h’I,T ≠ hI,T = $; g grade of h’I,T
φ+, φ- gene structures on s
h’IT compatible with φ+, but not with φ-
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
),|'(),(
),|'(),(
)',,(
)',,(
)',|(
)',|(
shPsP
shPsP
hsP
hsP
hsP
hsP
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
),|'(),(
),|'(),(
)',,(
)',,(
)',|(
)',|(
shPsP
shPsP
hsP
hsP
hsP
hsP
titi
titi
shPsP
shPsP
,,
,,
),|'(),(
),|'(),(
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
),|'(),(
),|'(),(
)',,(
)',,(
)',|(
)',|(
shPsP
shPsP
hsP
hsP
hsP
hsP
titi
titi
shPsP
shPsP
,,
,,
),|'(),(
),|'(),(
ti TI
TIti
TI
TI
titi
shP
shPshPsP
shP
shPshPsP
, ,
,,
,
,
,,
),|(
),|'(),|(),(
),|(
),|'(),|(),(
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
ti TI
TIti
TI
TI
titi
shP
shPshPsP
shP
shPshPsP
, ,
,,
,
,
,,
),|(
),|'(),|(),(
),|(
),|'(),|(),(
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
ti TI
TIti
TI
TI
titi
shP
shPshPsP
shP
shPshPsP
, ,
,,
,
,
,,
),|(
),|'(),|(),(
),|(
),|'(),|(),(
),|$(
),(),|(),(
),|$(
),(),|(),(
,
,
shP
gTqshPsP
shP
gTqshPsP
TI
TI
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
),|$(
),(),|(),(
),|$(
),(),|(),(
,
,
shP
gTqshPsP
shP
gTqshPsP
TI
TI
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
),|$(
),(),|(),(
),|$(
),(),|(),(
,
,
shP
gTqshPsP
shP
gTqshPsP
TI
TI
),|$(),(
),|$(),(
),|(
),|(
,
,
shPgTq
shPgTq
hsP
hsP
TI
TI
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Result:
i.e. structure φ+, which is compatible with additional hint h’IT receives relative bonus
),|$(),(
),|$(),(
,
,
shPgTq
shPgTq
TI
TI
),|(),(
),|(),(
),|(
),|(
),|(
),|(
,
,
shPgTq
shPgTq
hsP
hsP
hsP
hsP
TI
TI
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Results (gene level) on data set sag178
% SN % SP
Augustus 42 38
GenScan 18 14
GeneID 17 17
HMMGene 20 7
Aug. + EST 49 46
Aug. + prot 71 68
Aug. combined 68 65
Aug. all 82 79
GenomeScan 37 38
TwinScan 20 25
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Using hints from DIALIGN alignments:
1. Obtain large human/mouse sequence pairs (up to 50kb) from UCSC
2. Run CHAOS to find anchor points3. Run DIALIGN using CHAOS anchor points4. Create hints h from DIALIGN fragments5. Run AUGUSTUS with hints
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Hints from DIALIGN fragments:
Segment covered by peptide fragment minus 33 bp at both ends defines exon part hint on all 6 reading frames.
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
Hints from DIALIGN fragments:
Consider fragments with score ≥ 20
Distinguish high scores (≥ 45) from low scores Consider reading frame given by DIALIGN Consider strand given by DIALIGN
=> 2*2*2 = 8 grades
04/21/23 Burkhard Morgenstern, Tunis 2007
AUGUSTUS+
AUGUSTUS best ab-initio method at EGASP
04/21/23 Burkhard Morgenstern, Tunis 2007
EGASP test results
AUGUSTUS
GENSCAN
geneid GeneMark.hmm
Genezilla
0
10
20
30
40
50
60
70
80
90
100 Nukleotid Level
Sensitivität
Spezifität
04/21/23 Burkhard Morgenstern, Tunis 2007
EGASP test results
AUGUSTUS
GENSCAN
geneid GeneMark.hmm
Genezilla
0
10
20
30
40
50
60
70
80
90
100 Exon Level
Sensitivität
Spezifität
04/21/23 Burkhard Morgenstern, Tunis 2007
EGASP test results
AUGUSTUS
GENSCAN
geneid GeneMark.hmm
Genezilla
0
2,5
5
7,5
10
12,5
15
17,5
20
22,5
25
27,5
30 Transkript Level
Sensitivität
Spezifität
04/21/23 Burkhard Morgenstern, Tunis 2007
EGASP test results
AUGUSTUS
GENSCAN
geneid GeneMark.hmm
Genezilla
0
2,5
5
7,5
10
12,5
15
17,5
20
22,5
25
27,5
30 Gen Level
Sensitivität
Spezifität
04/21/23 Burkhard Morgenstern, Tunis 2007
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Sn Sp Sn Sp Sn Sp Sn Sp
Base Exon Transcript Gene
Ac
cu
rac
y
AUGUSTUS
AUGUSTUS+DIALIGN
DOGFISH-C
SGP2
TWINSCAN
TWINSCAN-MARS
N-SCAN
EGASP test results
04/21/23 Burkhard Morgenstern, Tunis 2007
Ongoing projects
Brugia malayi (TIGR)
Aedes aegypti (TIGR)
Schistosoma mansoni (TIGR)
Tetrahymena thermophilia (TIGR)
Galdieria Sulphuraria (Michigan State Univ.)
Coprinus cinereus (Univ. Göttingen)
Tribolium castaneum (Univ. Göttingen)