Medical Natural Sciences Year 2: Introduction to ... · EEEEE HHHHH EEE H EEEE HHHH EE HHH EEEE...

Medical Natural Sciences Year 2:Introduction to Bioinformatics

Lecture 9:Multiple sequence alignment (III)

Centre for Integrative Bioinformatics VU

Intermezzo: Symmetry-derived secondary structure prediction using

multiple sequence alignments (SymSSP)

Victor Simossis Jaap Heringa

Centre for Integrative Bioinformatics VU (IBIVU)Vrije Universiteit

Amsterdam, The Netherlands

Symmetry-derived secondary structure prediction using multiple

sequence alignments (SymSSP)• Modern state-of-the-art methods use multiple sequence alignments

•Methods like PhD, Profs, SSPro, etc., predict for the top sequence in the alignment by cutting out positions with gaps in the top sequence

• What if two helices ‘out of phase’ are pasted together? Or a strand and a helix?

• Approach: correct by permuting alignments and consensus prediction

Secondary structure periodicity patterns

Burried β-strand

Edge β-strand

α-helix

hydrophobic hydrophilic

Symmetry-derived Secondary structure prediction using MA (SymSSP)

1234

2134

3124

4123

EEEEE HHHHHH EEEEE HH

EEEE? ?HHHHH EEE H

EEEEE HHHHH? ??EE HH

EEEEEE ?HHHHH EEEE HH

EEEEE HHHHHH EEE HH

EEEE? ?HHHHH EEE H


EEEEE ?HHHHH EEEE HH

EEEEE HHHH EEE HH

EEEE? ?HHH EEE H

EEEEE HHH? ??EE HH

EEEEE HHH? EEEE HH

EEEEE HHHHHH EEE HHHH

EEEE? ?HHHHH EEE ?HHH

EEEEE HHHHH? ??EE HHHH

EEEEE ?HHHHH EEEE HHHH

1111

EEEEE HHHHH EEE H EEEE HHHH EE HHH EEEE HHHHH EEE H EEEE HHH EEE HH

Optimal segmentation of predicted secondary structures

H score 0 0 0 0 0….E score 3 4 4 4 3….C score 1 0 0 0 0…..

1234

EEEEE HHHHHH EEEEE HH

EEEE? ?HHHHH EEE H


EEEEEE ?HHHHH EEEE HH

? Score 0 0 0 0 1….Region 0 1 1 1 0….

CEH

Each sequence within an alignment gives riseto a library of n secondary structure predictions, where n is the number of sequences in the alignment.

The predictions are recorded by secondary structure type and region position in a single matrix

1->11->21->31->4

Optimal segmentation of predicted secondary structures by Dynamic Programming

sequence position

window size

Max scoreOffsetLabel

H scoreE scoreC score

The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score.

Restrictions:H only if ws>=4E only if ws>=2

5H

2 6

Segmentation score (Total score of each path)

? scoreRegion

Example of an optimally segmented secondary structure prediction library for sequence 3chy3chy ---------------GYVV-----KPFTAATLEEKLNKIFEKLGM------3chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ????????3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ????????3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ????????3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ????????3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ????????3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ?????????3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ?????????3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ??????3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ??????3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ????3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ??????3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ???????????3chy <- 3chy --------------- ----- hhhhhhhhhhhhhh ------

Consensus ---------------EEEE----- HHHHHHHHHHHHH ------Consensus-DSSP ...............****.....****xx***************......

PHD --------------- ----- HHHHHHHHHHHHHH ------PHD-DSSP ...............xxxx.....******************x**......

DSSP ...............EEEE.....SS HHHHHHHHHHHHHHHT ......LumpDSSP ...............EEEE..... HHHHHHHHHHHHHHH ......

Symmetry-derived secondary structure prediction (SymSSP)

• Tried over 120 different consensus weighting schemes (global, regional, positional)

• Over ~2700 Homstrad alignments and compared to PHD, on average 0.5% better

• 60% of the alignments are improved, 20% not affected and 20% is made worse

• Tried to correlate schemes with “cheap” a priori data (pairwise identities, sequence lengths, number of sequences, etc.)

Integrating secondary structure prediction and multiple sequence

alignment• Low key example shown of fairly

homogeneous data (strings of letters in both cases)

• But already difficult to do and methods are not easily tunable

• How to scale up to knowledge-integrating and inference engines?

Strategies for multiple sequence alignment

• Profile pre-processing• Secondary structure-induced

alignment• Globalised local alignment• Matrix extension

Objective: try to avoid (early) errors

Globalised local alignment

• Aim: fill each DP search matrix with the highest possible local alignment going through that cell

• Problem: Forward calculation + traceback for each local alignment is too slow

• Solution: Double dynamic programming1. Local DP in forward and reverse direction (no

traceback) + matrix summation2. Global DP over matrix from step 1 + traceback

Globalised local alignment

1. Local (SW) alignment (M + Po,e)

+ =

2. Global (NW) alignment (no M or Po,e)

Double dynamic programming

M = BLOSUM62, Po= 0, Pe= 0

Strategies for multiple sequence alignment

• Profile pre-processing• Secondary structure-induced

alignment• Globalised local alignment• Matrix extension

Objective: try to avoid (early) errors

Integrating alignment methods and alignment information with

T-Coffee• Integrating different pair-wise alignment

techniques (NW, SW, ..)• Combining different multiple alignment

methods (consensus multiple alignment)• Combining sequence alignment methods

with structural alignment techniques• Plug in user knowledge

Matrix extension

T-CoffeeTree-based Consistency Objective Function

For alignmEnt Evaluation

Cedric NotredameDes HigginsJaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205302, 205--217217;2000;2000

Using different sources of alignment information

Structure alignmentsClustalClustal

Lalign ManualDialign

T-Coffee

Progressive multiple alignment12134

Score 1-2

Score 1-3

Score 4-5

ScoresSimilaritymatrix

5

5×5

Guide tree Multiple alignment

Default T-COFFEE

• Uses information from all sequences for each pair-wise alignment

• Reconciles global and local alignment information

T-Coffee matrix extension

12

13

14

23

24

34

Search matrix extension

T-Coffee• Combine different alignment techniques by adding scores:

W(A(x), B(y)) = ∑S(A(x), B(y))

– A(x) is residue x in sequence A– summation is over the scores S of the global and local

alignments containing the residue pair (A(x), B(y))– S is sequence identity percentage of the associated alignment

• Combine direct alignment seqA- seqB with each seqA-seqI-seqB:

W’(A(x), B(y)) = W(A(x), B(y)) + ∑I≠A,BMin(W(A(x), I(z)), W(I(z), B(y)))

– Summation over all third sequences I other than A or B

T-Coffee

Direct alignment

Other sequences

T-Coffee library system

Seq1 AA1 Seq2 AA2 Weight

3 V31 5 L33 103 V31 6 L34 14

5 L33 6 R35 215 l33 6 I36 35

T-Coffee progressive alignment

MDAGSTVILCFVGMDAASTILCGS

Amino Acid Exchange Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

Kinase nucleotide binding sites

Comparing T-coffee with other methods

but.....T-COFFEE (V1.23) multiple sequence alignment Flavodoxin-cheY1fx1 ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----FLAV_DESVH ---MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----FLAV_DESGI ---MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK-----FLAV_DESSA ---MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK-----FLAV_DESDE ---MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK-----4fxn ------MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE-------ESEFEPF-IEEIS-TKISGKK-----FLAV_MEGEL -----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE-------DSVVEPF-FTDLA-PKLKGKK-----FLAV_CLOAB ----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN---------ISWEMKKW-IDESSEFNLEGKL-----2fcr -----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP-----FLAV_ENTAG ---MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT-----FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL--------QSDWEGL-YSELDDVDFNGKL-----FLAV_AZOVI ----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT-----FLAV_ECOLI ----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA--------QCDWDDF-FPTLEEIDFNGKL-----3chy ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE--------------LLKTIRADGAMSALPVLMV

:. . . : . ::

1fx1 ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------FLAV_DESVH ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------FLAV_DESGI ---------VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS---------------------LKIDGEPDSA----EVLDWAREVLARV--------FLAV_DESSA ---------VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS---------------------LKIDGDPE----RDEIVSWGSGIADKI--------FLAV_DESDE ---------VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG---------------------LKMEGDASND--PEAVASFAEDVLKQL--------4fxn ---------VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP---------------------LIVQNEPD--EAEQDCIEFGKKIANI---------FLAV_MEGEL ---------VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA---------------------IV--NEMP--DNAPECKELGEAAAKA---------FLAV_CLOAB ---------GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF-----------2fcr ---------VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------FLAV_ENTAG ---------VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL-------FLAV_ANASP ---------VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL------FLAV_AZOVI ---------VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL----FLAV_ECOLI ---------VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA3chy TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM----------------------------------------------------------

.

Evaluating multiple alignmentsEvaluating multiple alignments• Conflicting standards of truth

– evolution– structure– function

• With orphan sequences no additional information• Benchmarks depending on reference alignments• Quality issue of available reference alignment

databases• Different ways to quantify agreement with

reference alignment (sum-of-pairs, column score)• “Charlie Chaplin” problem

Evaluating multiple alignmentsEvaluating multiple alignments

• As a standard of truth, often a reference alignment based on structural superpositioning is taken

Evaluation measuresQuery Reference

Column score

Sum-of-Pairs score

Scoring a multiple alignment

Query

Sum-of-Pairs score:

•For each alignment position: take the sum of all pairs (add a.a. exchange values)

•As an option, subtract gap penalties

Evaluating multiple alignmentsEvaluating multiple alignments

∆SP

BAliBASE alignment nseq * len

Summary

• Weighting schemes simulating simultaneous multiple alignment– Profile pre-processing (global/local)– Matrix extension (well balanced scheme)

• Smoothing alignment signals– globalised local alignment

• Using additional information– secondary structure driven alignment

• Schemes strike balance between speed and sensitivity

References

• Heringa, J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, 341-364.

• Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217.

• Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.

Where to find this….http://www.ibivu.cs.vu.nl/teaching

Medical Natural Sciences Year 2: Introduction to ... · EEEEE HHHHH EEE H EEEE HHHH EE HHH EEEE...

Documents

Transcript of Medical Natural Sciences Year 2: Introduction to ... · EEEEE HHHHH EEE H EEEE HHHH EE HHH EEEE...