Tools for multiple sequence alignment

80
Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November 2007

description

Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November 2007. Tools for multiple sequence alignment. Multiple alignment basis of (almost) all methods for sequence analysis in bioinformatics. - PowerPoint PPT Presentation

Transcript of Tools for multiple sequence alignment

Page 1: Tools for multiple sequence alignment

Methods course

Multiple sequence alignment andReconstruction of phylogenetic trees

Burkhard Morgenstern, Fabian Schreiber

Göttingen, October/November 2007

Page 2: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Multiple alignment basis of (almost) all methods for sequence analysis in bioinformatics

Page 3: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I M R E A Q Y E

T C I V M R E A Y E

Page 4: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I - M R E A Q Y E

T C I V M R E A - Y E

Page 5: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I M R E A Q Y E

T C I V M R E A Y E

Y I M Q E V Q Q E

Y I A M R E Q Y E

Page 6: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I - M R E A Q Y E

T C I V M R E A - Y E

Y - I - M Q E V Q Q E

Y – I A M R E - Q Y E

Page 7: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I - M R E A Q Y E

T C I V M R E A - Y E

- Y I - M Q E V Q Q E

Y – I A M R E - Q Y E

Astronomical Number of possible alignments!

Page 8: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I - M R E A Q Y E

T C I V - M R E A Y E

- Y I - M Q E V Q Q E

Y – I A M R E - Q Y E

Astronomical Number of possible alignments!

Page 9: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I - M R E A Q Y E

T C I V M R E A - Y E

- Y I - M Q E V Q Q E

Y – I A M R E - Q Y E

Which one is the best ???

Page 10: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Questions in development of alignment programs:

(1) What is a good alignment?

→ objective function (`score’)

(2) How to find a good alignment?

→ optimization algorithm

Page 11: Tools for multiple sequence alignment

Tools for multiple sequence alignment

What is a biologically good alignment ??

Page 12: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Criteria for alignment quality:

1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!

2. Evolution: align residues with common ancestors!

Page 13: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I - M R E A Q Y E

T C I V M - R E A Y E

- Y I - M Q E V Q Q E

- Y I A M R E - Q Y E

Alignment hypothesis about sequence evolution

Search for most plausible hypothesis!

Page 14: Tools for multiple sequence alignment

Tools for multiple sequence alignment

T Y I - M R E A Q Y E

T C I V - M R E A Y E

- Y I - M Q E V Q Q E

- Y I A M R E - Q Y E

Alignment hypothesis about sequence evolution

Search for most plausible hypothesis!

Page 15: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Compute for amino acids a and b

Probability pa,b of substitution a → b (or b → a),

Frequency qa of a

Define similarity score s(a,b) based on pa,b , qa

Result: similarity matrix (substitution matrix), e.g. PAM (Dayhoff matrix), BLOSUM, …

Page 16: Tools for multiple sequence alignment
Page 17: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Page 18: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Traditional objective functions:

Define Score of alignments as

Sum of individual similarity scores s(a,b) of aligned amino acid residues

Gap penalty g for each gap in alignment

Optimal alignment can be calculated for two sequences but in practice not for > 8 sequences

Page 19: Tools for multiple sequence alignment

T Y W I V

T - - L V

Example:

Score = s(T,T) + s(I,L) + s (V,V) – 2 g

Page 20: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Most commonly used heuristic for multiple alignment:

Progressive alignment (mid 1980s):

Idea: calculate multiple alignment as series of pairwise

alignments of sequences and profiles Use guide tree to determine order of pairwise

alignments

Page 21: Tools for multiple sequence alignment

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Page 22: Tools for multiple sequence alignment

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Guide tree

Page 23: Tools for multiple sequence alignment

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WW--RLNDKEGYVPRNLLGLYP-

AVVIQDNSDIKVVP--KAKIIRD

YAVESEASFQPVAALERIN

WLNYNEERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

Page 24: Tools for multiple sequence alignment

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WW--RLNDKEGYVPRNLLGLYP-

AVVIQDNSDIKVVP--KAKIIRD

YAVESEASVQ--PVAALERIN------

WLN-YNEERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

Page 25: Tools for multiple sequence alignment

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN-

WW--RLNDKEGYVPRNLLGLYP-

AVVIQDNSDIKVVP--KAKIIRD

YAVESEASVQ--PVAALERIN------

WLN-YNEERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

Page 26: Tools for multiple sequence alignment

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN--------

WW--RLNDKEGYVPRNLLGLYP--------

AVVIQDNSDIKVVP--KAKIIRD-------

YAVESEA---SVQ--PVAALERIN------

WLN-YNE---ERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

Page 27: Tools for multiple sequence alignment

CLUSTAL W

Most important software program: CLUSTAL W:

J. Thompson, T. Gibson, D. Higgins (1994, Nuc. Acids Res.)

(22,327 citations in the literaterature!, Oct 2007)

Page 28: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Problems with traditional approach:

Results depend on gap penalty

Heuristic guide tree determines alignment;

alignment used for phylogeny reconstruction

Algorithm produces global alignments.

Page 29: Tools for multiple sequence alignment

Tools for multiple sequence alignment

Problems with traditional approach:

But:

Many sequence families share only local similarity

E.g. sequences share one conserved motif

Page 30: Tools for multiple sequence alignment

Local sequence alignment

Find common motif in sequences; ignore the rest

EYENS

ERYENS

ERYAS

Page 31: Tools for multiple sequence alignment

Local sequence alignment

Find common motif in sequences; ignore the rest

E-YENS

ERYENS

ERYA-S

Page 32: Tools for multiple sequence alignment

Local sequence alignment

Find common motif in sequences; ignore the rest – Local alignment

E-YENSERYENSERYA-S

Page 33: Tools for multiple sequence alignment

Gibbs Motive Sampler

Local multiple alignment without gaps:

E.g. Gibbs sampling

C.E. Lawrence et al. (1993, Science)

Page 34: Tools for multiple sequence alignment

Traditional alignment approaches:

Either global or local methods!

Page 35: Tools for multiple sequence alignment

New question: sequence families with multiple local similarities

Neither local nor global methods appliccable

Page 36: Tools for multiple sequence alignment

New question: sequence families with multiple local similarities

Alignment possible if order conserved

Page 37: Tools for multiple sequence alignment

The DIALIGN approach

Morgenstern, Dress, Werner (1996, Proc Natl. Acad. Sci.)

Combination of global and local methods

Assemble multiple alignment from gap-free local pairwise alignments (,,fragments“)

Page 38: Tools for multiple sequence alignment

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 39: Tools for multiple sequence alignment

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 40: Tools for multiple sequence alignment

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 41: Tools for multiple sequence alignment

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 42: Tools for multiple sequence alignment

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 43: Tools for multiple sequence alignment

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 44: Tools for multiple sequence alignment

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 45: Tools for multiple sequence alignment

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

Page 46: Tools for multiple sequence alignment

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

Page 47: Tools for multiple sequence alignment

The DIALIGN approach

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

Page 48: Tools for multiple sequence alignment

The DIALIGN approach

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

Consistency!

Page 49: Tools for multiple sequence alignment

The DIALIGN approach

atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa

Page 50: Tools for multiple sequence alignment

The DIALIGN approach

Advantages of segment-based approach:

Program can produce global and local alignments!

Sequence families alignable that cannot be aligned with standard methods

Page 51: Tools for multiple sequence alignment

T-COFFEE

C. Notredame, D. Higgins, J. Heringa (2000, J. Mol. Biol.)

Combination of global and local methods

Page 52: Tools for multiple sequence alignment

T-COFFEE

SeqA GARFIELD THE LAST FAT CAT

SeqB GARFIELD THE FAST CAT

SeqC GARFIELD THE VERY FAST CAT

SeqD THE FAT CAT

Page 53: Tools for multiple sequence alignment

T-COFFEE

SeqA GARFIELD THE LAST FAT CAT

SeqB GARFIELD THE FAST CAT

SeqC GARFIELD THE VERY FAST CAT

SeqD THE FAT CAT

Page 54: Tools for multiple sequence alignment

T-COFFEE

SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT

SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CA-T --- SeqC GARFIELD THE VERY FAST CAT SeqD ---------THE ---- FA-T CAT

SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD ---------THE ---- FAT CAT

SeqB GARFIELD THE ---- FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqD ---------THE FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqD ---------THE ---- FA-T CAT

Pairwise Alignments

Progressive Alignment

Page 55: Tools for multiple sequence alignment

Mixing Heterogenous Data With T-Coffee

Local Alignment Global Alignment

Multiple Sequence Alignment

Multiple Alignment

StructuralSpecialist

Page 56: Tools for multiple sequence alignment
Page 57: Tools for multiple sequence alignment

T-COFFEE

T-COFFEE

Idea:

1. Build library of pairwise alignments

2. Alignment from seq i, j and seq j, k supports alignment from seq i, k.

Page 58: Tools for multiple sequence alignment

T-COFFEE

T-COFFEE Less sensitive to spurious pairwise similarities Can handle local homologies better than CLUSTAL

Page 59: Tools for multiple sequence alignment

Evaluation of multi-alignment methods

Alignment evaluation by comparison to trusted benchmark alignments.

`True’ alignment known by information about structure or evolution.

Page 60: Tools for multiple sequence alignment

1aboA 1 .NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1 .NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1 .drvrkksga.........awqGQIVGWYctnlt.............peG

1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

Key

alpha helix RED beta strand GREEN core blocks UNDERSCORE BAliBASE

Reference alignments

Evaluation of multi-alignment methods

Page 61: Tools for multiple sequence alignment

Result: DIALIGN best method for distantly related sequences, T-Coffee best for globally related proteins

Page 62: Tools for multiple sequence alignment

Evaluation of multi-alignment methods

Conclusion: no single best multi alignment program!

Advice: try different methods!

Page 63: Tools for multiple sequence alignment

Tools for phylogeny reconstruction

Two approaches covered in this course:

Distance methods, e.g. Neighbour-Joining Maximum Likelihood

Other important methods (not covered in this course):

Maximum parsimony Bayesian approaches

Page 64: Tools for multiple sequence alignment

Tools for phylogeny reconstruction

Phylogenetic trees:

rooted trees unrooted trees

Many methods produce unrooted trees: find root using outgroup!

Page 65: Tools for multiple sequence alignment

Biological Question:Are Sponges mono-/paraphyletic?

Phylogenetic Reconstuction: An Example

Organims of interest:Sponge

Page 66: Tools for multiple sequence alignment

Build Dataset

Dataset

Query Sequence

DNA/Protein Sequencefrom Sponge Gene

Search for Homologsusing e.g BLAST

Hits from Search:“putative” homologs

Page 67: Tools for multiple sequence alignment

Sequence alignment

Dataset

Sequence Alignment

Hits from Search:“putative” homologs

Alignment tools:-Clustalw-T-Coffee-Dialign...many more

Use

to bring sequencesin relation

Page 68: Tools for multiple sequence alignment

Alignment

PhylogeneticTree

Phylogeny Methods:Distance-based:---Nj---UPGMAParsimony:---Max.Parsimony(Phylip/Paup)Statistical:---Max.Likelihood (Phyml)---Bayesian Inf. (MrBayes)

Estimate Phylogeny

Page 69: Tools for multiple sequence alignment

Interpretate results

Hypothesis: Sponges are monophyletic

Page 70: Tools for multiple sequence alignment

Tools for phylogeny reconstruction

Distance methods: For N sequences S1, … SN: Calculate distance d(i,j) for any two sequences Si and Sj

Goal find tree that represents all distances d(i,j) as closely as possible

To calculate distances d(i,j) : construct multiple alignment of input sequences, consider substitutions implied by alignment

Page 71: Tools for multiple sequence alignment

Matrix of pairwise distances d(i,j)

Page 72: Tools for multiple sequence alignment

Find tree that corresponds to distances d(i,j)

Page 73: Tools for multiple sequence alignment

Tools for phylogeny reconstruction

Maximum likelihood:

Consider evolution of sequences as random process. Stochastical model assigns probabilities to substitutions.

Consider tree T as hypothesis about observed sequence data D

Search tree with highest likelihood P(D|T)

Page 74: Tools for multiple sequence alignment

Tools for phylogeny reconstruction

Assumptions:

Positions in sequences (colums in alignment) independent of each other

Events on different branches of tree independent of each other

Result: probabilities can be multiplied

Page 75: Tools for multiple sequence alignment

Probability P(D|T) for given residues at internal nodes

Page 76: Tools for multiple sequence alignment
Page 77: Tools for multiple sequence alignment
Page 78: Tools for multiple sequence alignment

Consider all possible residues for internal nodes

Page 79: Tools for multiple sequence alignment

Testing the reliability of a tree (or parts of it): the bootstrap approach

Bootstrap in general: repeat statistical test after random “re-sampling”, i.e. by drawing additional sample data.

In phylogeny:

1. Select randomly columns from Alignment and repeat tree reconstruction with the same method (e.g. 1000 times)

2. Calculate for every branch: how often is it observed in newly constructed trees?

Page 80: Tools for multiple sequence alignment