Medical Natural Sciences Year 2: Introduction to ... · EEEEE HHHHH EEE H EEEE HHHH EE HHH EEEE...
Embed Size (px)
Transcript of Medical Natural Sciences Year 2: Introduction to ... · EEEEE HHHHH EEE H EEEE HHHH EE HHH EEEE...

Medical Natural Sciences Year 2:Introduction to Bioinformatics
Lecture 9:Multiple sequence alignment (III)
Centre for Integrative Bioinformatics VU

Intermezzo: Symmetry-derived secondary structure prediction using
multiple sequence alignments (SymSSP)
Victor Simossis Jaap Heringa
Centre for Integrative Bioinformatics VU (IBIVU)Vrije Universiteit
Amsterdam, The Netherlands

Symmetry-derived secondary structure prediction using multiple
sequence alignments (SymSSP)• Modern state-of-the-art methods use multiple sequence alignments
•Methods like PhD, Profs, SSPro, etc., predict for the top sequence in the alignment by cutting out positions with gaps in the top sequence
• What if two helices ‘out of phase’ are pasted together? Or a strand and a helix?
• Approach: correct by permuting alignments and consensus prediction

Secondary structure periodicity patterns
Burried β-strand
Edge β-strand
α-helix
hydrophobic hydrophilic

Symmetry-derived Secondary structure prediction using MA (SymSSP)
1234
2134
3124
4123
EEEEE HHHHHH EEEEE HH
EEEE? ?HHHHH EEE H
EEEEE HHHHH? ??EE HH
EEEEEE ?HHHHH EEEE HH
EEEEE HHHHHH EEE HH
EEEE? ?HHHHH EEE H
EEEEE HHHHH? ??EE HH
EEEEE ?HHHHH EEEE HH
EEEEE HHHH EEE HH
EEEE? ?HHH EEE H
EEEEE HHH? ??EE HH
EEEEE HHH? EEEE HH
EEEEE HHHHHH EEE HHHH
EEEE? ?HHHHH EEE ?HHH
EEEEE HHHHH? ??EE HHHH
EEEEE ?HHHHH EEEE HHHH
1111
EEEEE HHHHH EEE H EEEE HHHH EE HHH EEEE HHHHH EEE H EEEE HHH EEE HH

Optimal segmentation of predicted secondary structures
H score 0 0 0 0 0….E score 3 4 4 4 3….C score 1 0 0 0 0…..
1234
EEEEE HHHHHH EEEEE HH
EEEE? ?HHHHH EEE H
EEEEE HHHHH? ??EE HH
EEEEEE ?HHHHH EEEE HH
? Score 0 0 0 0 1….Region 0 1 1 1 0….
CEH
Each sequence within an alignment gives riseto a library of n secondary structure predictions, where n is the number of sequences in the alignment.
The predictions are recorded by secondary structure type and region position in a single matrix
1->11->21->31->4

Optimal segmentation of predicted secondary structures by Dynamic Programming
sequence position
window size
Max scoreOffsetLabel
H scoreE scoreC score
The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score.
Restrictions:H only if ws>=4E only if ws>=2
5H
2 6
Segmentation score (Total score of each path)
? scoreRegion

Example of an optimally segmented secondary structure prediction library for sequence 3chy3chy ---------------GYVV-----KPFTAATLEEKLNKIFEKLGM------3chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ????????3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ????????3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ????????3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ????????3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ????????3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ?????????3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ?????????3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ??????3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ??????3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ????3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ??????3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ???????????3chy <- 3chy --------------- ----- hhhhhhhhhhhhhh ------
Consensus ---------------EEEE----- HHHHHHHHHHHHH ------Consensus-DSSP ...............****.....****xx***************......
PHD --------------- ----- HHHHHHHHHHHHHH ------PHD-DSSP ...............xxxx.....******************x**......
DSSP ...............EEEE.....SS HHHHHHHHHHHHHHHT ......LumpDSSP ...............EEEE..... HHHHHHHHHHHHHHH ......

Symmetry-derived secondary structure prediction (SymSSP)
• Tried over 120 different consensus weighting schemes (global, regional, positional)
• Over ~2700 Homstrad alignments and compared to PHD, on average 0.5% better
• 60% of the alignments are improved, 20% not affected and 20% is made worse
• Tried to correlate schemes with “cheap” a priori data (pairwise identities, sequence lengths, number of sequences, etc.)

Integrating secondary structure prediction and multiple sequence
alignment• Low key example shown of fairly
homogeneous data (strings of letters in both cases)
• But already difficult to do and methods are not easily tunable
• How to scale up to knowledge-integrating and inference engines?

Strategies for multiple sequence alignment
• Profile pre-processing• Secondary structure-induced
alignment• Globalised local alignment• Matrix extension
Objective: try to avoid (early) errors

Globalised local alignment
• Aim: fill each DP search matrix with the highest possible local alignment going through that cell
• Problem: Forward calculation + traceback for each local alignment is too slow
• Solution: Double dynamic programming1. Local DP in forward and reverse direction (no
traceback) + matrix summation2. Global DP over matrix from step 1 + traceback

Globalised local alignment
1. Local (SW) alignment (M + Po,e)
+ =
2. Global (NW) alignment (no M or Po,e)
Double dynamic programming

M = BLOSUM62, Po= 0, Pe= 0

M = BLOSUM62, Po= 12, Pe= 1

M = BLOSUM62, Po= 60, Pe= 5

Strategies for multiple sequence alignment
• Profile pre-processing• Secondary structure-induced
alignment• Globalised local alignment• Matrix extension
Objective: try to avoid (early) errors

Integrating alignment methods and alignment information with
T-Coffee• Integrating different pair-wise alignment
techniques (NW, SW, ..)• Combining different multiple alignment
methods (consensus multiple alignment)• Combining sequence alignment methods
with structural alignment techniques• Plug in user knowledge

Matrix extension
T-CoffeeTree-based Consistency Objective Function
For alignmEnt Evaluation
Cedric NotredameDes HigginsJaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205302, 205--217217;2000;2000

Using different sources of alignment information
Structure alignmentsClustalClustal
Lalign ManualDialign
T-Coffee

Progressive multiple alignment12134
Score 1-2
Score 1-3
Score 4-5
ScoresSimilaritymatrix
5
5×5
Guide tree Multiple alignment

Default T-COFFEE
• Uses information from all sequences for each pair-wise alignment
• Reconciles global and local alignment information

T-Coffee matrix extension
12
13
14
23
24
34

Search matrix extension

T-Coffee• Combine different alignment techniques by adding scores:
W(A(x), B(y)) = ∑S(A(x), B(y))
– A(x) is residue x in sequence A– summation is over the scores S of the global and local
alignments containing the residue pair (A(x), B(y))– S is sequence identity percentage of the associated alignment
• Combine direct alignment seqA- seqB with each seqA-seqI-seqB:
W’(A(x), B(y)) = W(A(x), B(y)) + ∑I≠A,BMin(W(A(x), I(z)), W(I(z), B(y)))
– Summation over all third sequences I other than A or B

T-Coffee
Direct alignment
Other sequences

T-Coffee library system
Seq1 AA1 Seq2 AA2 Weight
3 V31 5 L33 103 V31 6 L34 14
5 L33 6 R35 215 l33 6 I36 35

T-Coffee progressive alignment
MDAGSTVILCFVGMDAASTILCGS
Amino Acid Exchange Matrix
Gap penalties (open,extension)
Search matrix
MDAGSTVILCFVG-MDAAST-ILC--GS

Kinase nucleotide binding sites

Comparing T-coffee with other methods

but.....T-COFFEE (V1.23) multiple sequence alignment Flavodoxin-cheY1fx1 ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----FLAV_DESVH ---MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----FLAV_DESGI ---MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK-----FLAV_DESSA ---MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK-----FLAV_DESDE ---MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK-----4fxn ------MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE-------ESEFEPF-IEEIS-TKISGKK-----FLAV_MEGEL -----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE-------DSVVEPF-FTDLA-PKLKGKK-----FLAV_CLOAB ----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN---------ISWEMKKW-IDESSEFNLEGKL-----2fcr -----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP-----FLAV_ENTAG ---MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT-----FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL--------QSDWEGL-YSELDDVDFNGKL-----FLAV_AZOVI ----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT-----FLAV_ECOLI ----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA--------QCDWDDF-FPTLEEIDFNGKL-----3chy ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE--------------LLKTIRADGAMSALPVLMV
:. . . : . ::
1fx1 ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------FLAV_DESVH ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------FLAV_DESGI ---------VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS---------------------LKIDGEPDSA----EVLDWAREVLARV--------FLAV_DESSA ---------VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS---------------------LKIDGDPE----RDEIVSWGSGIADKI--------FLAV_DESDE ---------VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG---------------------LKMEGDASND--PEAVASFAEDVLKQL--------4fxn ---------VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP---------------------LIVQNEPD--EAEQDCIEFGKKIANI---------FLAV_MEGEL ---------VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA---------------------IV--NEMP--DNAPECKELGEAAAKA---------FLAV_CLOAB ---------GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF-----------2fcr ---------VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------FLAV_ENTAG ---------VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL-------FLAV_ANASP ---------VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL------FLAV_AZOVI ---------VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL----FLAV_ECOLI ---------VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA3chy TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM----------------------------------------------------------
.

Evaluating multiple alignmentsEvaluating multiple alignments• Conflicting standards of truth
– evolution– structure– function
• With orphan sequences no additional information• Benchmarks depending on reference alignments• Quality issue of available reference alignment
databases• Different ways to quantify agreement with
reference alignment (sum-of-pairs, column score)• “Charlie Chaplin” problem

Evaluating multiple alignmentsEvaluating multiple alignments
• As a standard of truth, often a reference alignment based on structural superpositioning is taken

Evaluation measuresQuery Reference
Column score
Sum-of-Pairs score

Scoring a multiple alignment
Query
Sum-of-Pairs score:
•For each alignment position: take the sum of all pairs (add a.a. exchange values)
•As an option, subtract gap penalties

Evaluating multiple alignmentsEvaluating multiple alignments
∆SP
BAliBASE alignment nseq * len

Summary
• Weighting schemes simulating simultaneous multiple alignment– Profile pre-processing (global/local)– Matrix extension (well balanced scheme)
• Smoothing alignment signals– globalised local alignment
• Using additional information– secondary structure driven alignment
• Schemes strike balance between speed and sensitivity

References
• Heringa, J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, 341-364.
• Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217.
• Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.

Where to find this….http://www.ibivu.cs.vu.nl/teaching