Medical Natural Sciences Year 2: Introduction to Bioinformatics Lecture 9: Multiple sequence...
-
Upload
dwight-park -
Category
Documents
-
view
218 -
download
2
Transcript of Medical Natural Sciences Year 2: Introduction to Bioinformatics Lecture 9: Multiple sequence...
Medical Natural Sciences Year 2:Introduction to Bioinformatics
Lecture 9:
Multiple sequence alignment (III)
Centre for Integrative Bioinformatics VU
Intermezzo: Symmetry-derived secondary structure prediction using
multiple sequence alignments (SymSSP)
Victor Simossis Jaap Heringa
Centre for Integrative Bioinformatics VU (IBIVU)Vrije Universiteit
Amsterdam, The Netherlands
Symmetry-derived secondary structure prediction using multiple
sequence alignments (SymSSP)• Modern state-of-the-art methods use multiple sequence alignments
•Methods like PhD, Profs, SSPro, etc., predict for the top sequence in the alignment by cutting out positions with gaps in the top sequence
• What if two helices ‘out of phase’ are pasted together? Or a strand and a helix?
• Approach: correct by permuting alignments and consensus prediction
Secondary structure periodicity patterns
Burried -strand
Edge -strand
-helix
hydrophobic hydrophilic
Symmetry-derived Secondary structure prediction using MA (SymSSP)
1234
2134
3124
4123
EEEEE HHHHHH EEEEE HH
EEEE? ?HHHHH EEE H
EEEEE HHHHH? ??EE HH
EEEEEE ?HHHHH EEEE HH
EEEEE HHHHHH EEE HH
EEEE? ?HHHHH EEE H
EEEEE HHHHH? ??EE HH
EEEEE ?HHHHH EEEE HH
EEEEE HHHH EEE HH
EEEE? ?HHH EEE H
EEEEE HHH? ??EE HH
EEEEE HHH? EEEE HH
EEEEE HHHHHH EEE HHHH
EEEE? ?HHHHH EEE ?HHH
EEEEE HHHHH? ??EE HHHH
EEEEE ?HHHHH EEEE HHHH
EEEEE HHHHH EEE H
EEEE HHHH EE HHH EEEE HHHHH EEE H EEEE HHH EEE HH
1111
Optimal segmentation of predicted secondary structures
H score 0 0 0 0 0….E score 3 4 4 4 3….
C score 1 0 0 0 0…..
1234
EEEEE HHHHHH EEEEE HH
EEEE? ?HHHHH EEE H
EEEEE HHHHH? ??EE HH
EEEEEE ?HHHHH EEEE HH
1->1
1->21->31->4
? Score 0 0 0 0 1….Region 0 1 1 1 0….
CEH
Each sequence within an alignment gives rise to a library of n secondary structure predictions, where n is the number of sequences in the alignment.
The predictions are recorded by secondary structure type and region position in a single matrix
Optimal segmentation of predicted secondary structures by Dynamic Programming
sequence position
window size
Max scoreOffsetLabel
H scoreE scoreC score
The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score.
Restrictions:H only if ws>=4E only if ws>=2
5H
2 6
Segmentation score (Total score of each path)
? scoreRegion
Example of an optimally segmented secondary structure prediction library for sequence 3chy 3chy ---------------GYVV-----KPFTAATLEEKLNKIFEKLGM------
3chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ????????
3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ????????
3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ????????
3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ????????
3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ????????
3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ?????????
3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ?????????
3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ??????
3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ??????
3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh
3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ????
3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ??????
3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ???????????
3chy <- 3chy --------------- ----- hhhhhhhhhhhhhh ------
Consensus ---------------EEEE----- HHHHHHHHHHHHH ------
Consensus-DSSP ...............****.....****xx***************......
PHD --------------- ----- HHHHHHHHHHHHHH ------
PHD-DSSP ...............xxxx.....******************x**......
DSSP ...............EEEE.....SS HHHHHHHHHHHHHHHT ......
LumpDSSP ...............EEEE..... HHHHHHHHHHHHHHH ......
Symmetry-derived secondary structure prediction (SymSSP)
• Tried over 120 different consensus weighting schemes (global, regional, positional)
• Over ~2700 Homstrad alignments and compared to PHD, on average 0.5% better
• 60% of the alignments are improved, 20% not affected and 20% is made worse
• Tried to correlate schemes with “cheap” a priori data (pairwise identities, sequence lengths, number of sequences, etc.)
Integrating secondary structure prediction and multiple sequence
alignment
• Low key example shown of fairly homogeneous data (strings of letters in both cases)
• But already difficult to do and methods are not easily tunable
• How to scale up to knowledge-integrating and inference engines?
• Profile pre-processing
• Secondary structure-induced alignment
• Globalised local alignment• Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Globalised local alignment
• Aim: fill each DP search matrix with the highest possible local alignment going through that cell
• Problem: Forward calculation + traceback for each local alignment is too slow
• Solution: Double dynamic programming1. Local DP in forward and reverse direction (no
traceback) + matrix summation
2. Global DP over matrix from step 1 + traceback
Globalised local alignment
+ =
1. Local (SW) alignment (M + Po,e)
2. Global (NW) alignment (no M or Po,e)
Double dynamic programming
M = BLOSUM62, Po= 0, Pe= 0
M = BLOSUM62, Po= 12, Pe= 1
M = BLOSUM62, Po= 60, Pe= 5
• Profile pre-processing
• Secondary structure-induced alignment
• Globalised local alignment
• Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Integrating alignment methods and alignment information with
T-Coffee• Integrating different pair-wise alignment
techniques (NW, SW, ..)
• Combining different multiple alignment methods (consensus multiple alignment)
• Combining sequence alignment methods with structural alignment techniques
• Plug in user knowledge
Matrix extension
T-CoffeeTree-based Consistency Objective Function
For alignmEnt Evaluation
Cedric Notredame
Des Higgins
Jaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205-217302, 205-217;2000;2000
Using different sources of alignment information
Clustal
Dialign
Clustal
Lalign
Structure alignments
Manual
T-Coffee
Progressive multiple alignment121345
Guide tree Multiple alignment
Score 1-2
Score 1-3
Score 4-5
Scores
Similaritymatrix
5×5
Default T-COFFEE
• Uses information from all sequences for each pair-wise alignment
• Reconciles global and local alignment information
T-Coffee matrix extension
12
13
14
23
24
34
Search matrix extension
T-Coffee• Combine different alignment techniques by adding scores:
W(A(x), B(y)) = S(A(x), B(y))
– A(x) is residue x in sequence A
– summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y))
– S is sequence identity percentage of the associated alignment
• Combine direct alignment seqA- seqB with each seqA-seqI-seqB:
W’(A(x), B(y)) = W(A(x), B(y)) +
IA,BMin(W(A(x), I(z)), W(I(z), B(y)))
– Summation over all third sequences I other than A or B
T-Coffee
Direct alignment
Other sequences
T-Coffee library system
Seq1 AA1 Seq2 AA2 Weight
3 V31 5 L33 103 V31 6 L34 14
5 L33 6 R35 215 l33 6 I36 35
T-Coffee progressive alignment
MDAGSTVILCFVGMDAASTILCGS
Amino Acid Exchange
Matrix
Gap penalties (open,extension)
Search matrix
MDAGSTVILCFVG-MDAAST-ILC--GS
Kinase nucleotide binding sites
Comparing T-coffee with other methods
but.....T-COFFEE (V1.23) multiple sequence alignment Flavodoxin-cheY1fx1 ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----
FLAV_DESVH ---MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----
FLAV_DESGI ---MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK-----
FLAV_DESSA ---MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK-----
FLAV_DESDE ---MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK-----
4fxn ------MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE-------ESEFEPF-IEEIS-TKISGKK-----
FLAV_MEGEL -----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE-------DSVVEPF-FTDLA-PKLKGKK-----
FLAV_CLOAB ----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN---------ISWEMKKW-IDESSEFNLEGKL-----
2fcr -----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP-----
FLAV_ENTAG ---MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT-----
FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL--------QSDWEGL-YSELDDVDFNGKL-----
FLAV_AZOVI ----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT-----
FLAV_ECOLI ----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA--------QCDWDDF-FPTLEEIDFNGKL-----
3chy ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE--------------LLKTIRADGAMSALPVLMV
:. . . : . ::
1fx1 ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------
FLAV_DESVH ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------
FLAV_DESGI ---------VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS---------------------LKIDGEPDSA----EVLDWAREVLARV--------
FLAV_DESSA ---------VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS---------------------LKIDGDPE----RDEIVSWGSGIADKI--------
FLAV_DESDE ---------VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG---------------------LKMEGDASND--PEAVASFAEDVLKQL--------
4fxn ---------VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP---------------------LIVQNEPD--EAEQDCIEFGKKIANI---------
FLAV_MEGEL ---------VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA---------------------IV--NEMP--DNAPECKELGEAAAKA---------
FLAV_CLOAB ---------GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF-----------
2fcr ---------VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------
FLAV_ENTAG ---------VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL-------
FLAV_ANASP ---------VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL------
FLAV_AZOVI ---------VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL----
FLAV_ECOLI ---------VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA
3chy TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM----------------------------------------------------------
.
Evaluating multiple alignmentsEvaluating multiple alignments• Conflicting standards of truth
– evolution
– structure
– function
• With orphan sequences no additional information
• Benchmarks depending on reference alignments
• Quality issue of available reference alignment databases
• Different ways to quantify agreement with reference alignment (sum-of-pairs, column score)
• “Charlie Chaplin” problem
Evaluating multiple alignmentsEvaluating multiple alignments
• As a standard of truth, often a reference alignment based on structural superpositioning is taken
Evaluation measuresQuery Reference
Column score
Sum-of-Pairs score
Scoring a multiple alignment
Query
Sum-of-Pairs score:
•For each alignment position: take the sum of all pairs (add a.a. exchange values)
•As an option, subtract gap penalties
Evaluating multiple alignmentsEvaluating multiple alignments
SP
BAliBASE alignment nseq * len
Summary
• Weighting schemes simulating simultaneous multiple alignment– Profile pre-processing (global/local)– Matrix extension (well balanced scheme)
• Smoothing alignment signals– globalised local alignment
• Using additional information– secondary structure driven alignment
• Schemes strike balance between speed and sensitivity
References
• Heringa, J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, 341-364.
• Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217.
• Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.
Where to find this….http://www.ibivu.cs.vu.nl/teaching