Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. ·...

18
1 Bioinformatics for Biologists Comparative Protein Analysis: Part I. Phylogenetic Trees and Multiple Sequence Alignments Robert Latek, PhD Sr. Bioinformatics Scientist Whitehead Institute for Biomedical Research WIBR Bioinformatics Course, © Whitehead Institute, October 2003 2 Comparative Protein Analysis Overview definition Use information regarding a group of sequences to determine the function of an undefined sequence. Extract novel information about a protein, or a series of proteins, through comparisons with other, related sequences. Application definition What are they? What are their functions? Why are they important?

Transcript of Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. ·...

Page 1: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

1

Bioinformatics for Biologists

Comparative Protein Analysis: Part I.Phylogenetic Trees and Multiple

Sequence AlignmentsRobert Latek, PhD

Sr. Bioinformatics ScientistWhitehead Institute for Biomedical Research

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 2

Comparative Protein AnalysisOverview definition

Use information regarding a group of sequences to determine thefunction of an undefined sequence.

Extract novel information about a protein, or a series of proteins,through comparisons with other, related sequences.

Application definitionWhat are they?What are their functions?Why are they important?

Page 2: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

2

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 3

Syllabus• Comparative Protein Analysis• Phylogenetic Tree Techniques and

Application• Multiple Sequence Alignment Techniques

and Application• Demonstration - Putting Trees and MSAs to

Work

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 4

Comparative Protein Analysis• Identify proteins within an organism that are

related to each other and across different species• Generate an evolutionary history of related genes• Locate insertions, deletions, and substitutions that

have occurred during evolution

CREATE CREASE -RELAPSE

GREASER

Page 3: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

3

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 5

Homology• Homology: conserved sequences arising from a

common ancestor– Orthologs: homologous genes that share a common

ancestor in the absence of any gene duplication(speciation)

– Paralogs: genes related through gene duplication (onegene is a copy of another)

• Similarity: genes that share common sequencesbut are not necessarily related

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 6

Syllabus

• Comparative Protein Analysis• Phylogenetic Tree Techniques and

Application• Multiple Sequence Alignments Techniques

and Application• Demonstration - Putting Trees and MSAs to

Work

Page 4: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

4

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 7

Phylogenetic Trees• A graph representing the evolutionary

history of a sequence• Relationship of one sequence to other

sequences• Dissect the order of appearance of

insertions, deletions, and mutations• Predict function, observe epidemiology,

analyzing changes in viral strains

ABC

D

SimpleTree

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 8

Tree ShapesRooted Un-rooted

Branches intersect at NodesLeaves are the topmost branches

A

B

C

D

A

B

C

D

A

B

C

D

Page 5: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

5

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 9

Number of Possible TreesLeaves

2345678910

Rooted Trees131510595410,395135,1352,027,02534,459,425

Un-rooted Trees1131510595410,395135,1352,027,025

(Li, 1997)

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 10

Tree Characteristics• Tree Properties

– Clade: all the descendants of a common ancestorrepresented by a node

– Distance: number of changes that have takenplace along a branch

• Tree Types– Cladogram: shows the branching order of nodes

– Phylogram: shows branching order and distances

A

B

C

D

.035

.009

.057

.044

.012

.016

Phylogram

Page 6: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

6

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 11

Tree Building Methods

• Maximum Parsimony

• Distance Methods– UPGMA– Neighbor Joining

• Maximum Likelihood

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 12

Maximum Parsimony

• Find the tree that changes one sequence into all of the others by theleast number of steps

• Only informative sites are analyzed (no gaps or conserved positions)• Can be misleading when rates of change vary in different tree branches

Tree I Tree II Tree III

Site 5

1AAAA

2AGGG

3GCAA

Site

OneTwo

ThreeFour

Gen

e

Informative Trees

(Select Tree I)

Alignment

(Li, 1991)

4ACTG

5GGAA

6TTTT

G

G

G

A

A

A

A

A

G

A

A

G

G

A

G

A

A

G1

2

3

4

1

3

2

4

1

4

2

3

Page 7: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

7

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 13

Distance Methods

• Distance is expressed as the fraction of sitesthat differ between two sequences in analignment

• Sequences with the smallest number ofchanges (shortest distance) are “relatedtaxa”

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 14

Distance Methods - UPGMA• UPGMA (Unweighted Pair-Group Method with

Arithmetic mean)– Sequentially find pair of taxa with smallest distance

between them, and define branching as midpoint of two– Assumes the tree is additive and that rate of change is

constant in all of the branches

DAB2 D(AB)C

2

D(ABC)D2

A

B

C

D

A

B

A

B

C

Page 8: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

8

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 15

Distance Methods - NJ• Neighbor-Joining (NJ): useful when there are different

rates of evolution within a tree– Each possible pair-wise alignment is examined. Calculate distance

from each sequence to every other sequence– Choose the pair with the lowest distance value and join them to

produce the minimal length tree– Update distance matrix where joined node is substituted for two

original taxa and then repeat process

AB

C

DF

G

H A

D

B C E

F

GH

2 11

A

D

B C

E

F

GH

2 1

3

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 16

Maximum Likelihood

• Best accounts for variation in sequences• Establish a probabilistic model with

multiple solutions and determine which ismost likely

• All possible trees are considered, therefore,only suitable for small number of sequences– Maximizes probability of finding optimal tree

Page 9: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

9

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 17

Tree Reliability• Probability that the members of a clade are always

members of that clade• Sample by Bootstrapping

– Random sites of an alignment are randomly sampled soas to create a dataset the same size as the original. Thesame analysis as applied to the original data set isperformed on the bootstrap dataset

– Construct a consensus bootstrap tree and compare tothe original tree

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 18

Which Method to Use?Is therestrong

sequencesimilarity?

Is there clearlyrecognizable

sequencesimilarity?

MaximumLikelihood

DistanceMethods

MaximumParsimony

yes

yes

no

no

(Mount, 2001)

Page 10: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

10

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 19

Syllabus

• Comparative Protein Analysis• Phylogenetic Tree Techniques and

Application• Multiple Sequence Alignments

Techniques and Application• Demonstration - Putting Trees and MSAs to

Work

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 20

Multiple Sequence Alignments• Place residues in columns that are derived

from a common ancestral residue• MSA can reveal sequence patterns

– Demonstration of homology between >2sequences

– Identification of functionally important sites– Protein function prediction– Structure prediction– Search for weak but significant similarities in

databases– Design PCR primers for related gene

identification– Genome sequencing: contig assembly

CREAT--E-CREAS--E-

-RELAPSE-GREAS--ER

CREATECREASE

GREASER

RELAPSE

123456789

SeqASeqBSeqCSeqD

Page 11: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

11

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 21

Multiple Sequence Alignment

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 22

Global vs. Local Alignments

• Global– Search for alignments, matching over entire

sequences• Local

– Examine regions of sequence for conservedsegments

• Both Consider: Matches, Mismatches, Gaps

Page 12: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

12

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 23

Approaches

• Optimal Global Alignments– Dynamic programming

• Global Progressive Alignments• Global Iterative Alignments• Local alignments

– Profiles, Blocks, Patterns

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 24

Optimal Global Alignments• Dynamic programming is used for aligning a

small number of sequences• Build matrices with every possible combination

and search for optimal solution– Optimal in the mathematical sense

• Problem gets large quickly– Length raised to number of sequences– Align 10 sequences of 100 aa length 10010

Page 13: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

13

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 25

Global Progressive Alignment• A heuristic approach that utilizes

phylogenetic information to assist inrouting the alignment(clustalw/clustalx)

• Feng & Doolittle1987, Higgins andSharp 1988

• Most alike sequences are alignedtogether in order of their similarity(tree-based), a consensus isdetermined and then aligned to nextmost similar sequence

Seq1 Seq2 Seq3 Seq4VMR VMK GMK GMV

VMRVMKGMKGMV

VMRVMK = VMR/K

VMR/KGMK =V/G M R/K

V/G M R/KG M V =V/G M V/R/K

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 26

Iterative Multiple Alignment• “Repeatedly re-align subgroups of

sequences into a global alignmentto improve alignment score”(Mount, 2001)

• Start with a progressive alignmentand tree

• Recalculate pair-wise scoresduring progressive alignment, usenew scores to rebuild the tree,which is used to improvealignments

Initial ProgressiveAlignment

Build Tree

Weight Based On Subgroup Alignments

Iterate MSA

Page 14: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

14

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 27

Global Alignment Errors

• Dependence of MSA on the initial pair-wisealignments

• Improper scoring when aligning a set ofsequences that have non-overlappingsegments

Seq1Seq2

Seq3???

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 28

Localized Alignments• Blocks

– Conserved region without gaps• Patterns

– a deterministic syntax that describes multiplecombinations of possible residues within a proteinstring

• Profiles– probabilistic generalizations that assign to every

segment position, a probability that each of the 20 aawill occur. Includes scores for substitutions and gapsfor the conserved region (consensus)

Page 15: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

15

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 29

Block Analysis

• Represent a conserved region within a MSA• Contain matches, mismatches, but no gaps• Serve as anchors to assist in aligning

sequences by aligning individual segmentsSeq1Seq2Seq3Seq4Seq5Seq6

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 30

Patterns (Motifs)

• Patterns are a string ofnon-contiguous motifs– Remove low

complexity regions– i.e. Docking site of a

kinase to a receptor

2AVLI

X3RKH

4YFWH

X X XX X X X X X

Motifs

Pattern

Page 16: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

16

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 31

Profile Analysis• Perform global MSA on group of sequences• Move highly conserved regions to smaller

MSAs• Generate scoring table with log odds scores

– Each column is independent– Average Method: profile matrix values are weighted

by the proportion of each amino acid in each columnof MSA

– Evolutionary Method: calculate the evolutionarydistance (Dayhoff model) required to generate theobserved amino acid distribution

CREATEM K D V S DL E Y R I P

%

Position

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 32

MSA and Tree Relationship• “The optimal alignment of several sequences can be thought

of as minimizing the number of mutational steps in anevolutionary tree for which the sequences are the leaves”(Mount, 2001)

(Mount, 2001)

N • F L S seqA

N • F - S seqB

N K Y L S seqC

N • Y L S seqD

Y to F

-L

-L

+K

Page 17: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

17

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 33

Pointers• When to use each method?

– Related sequences = global alignments– Divergent sequences = local alignments– Use together to build the ‘biologically relevant’

alignment• Applications

– MSAs: ClustalX, Jalview, Belvu– Annotation:

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 34

File Formats• MSF

– http://jura.wi.mit.edu/bio/education/bioinfo/session3/seq.msf

• ALN– http://jura.wi.mit.edu/bio/education/bioinfo/session3/seq.aln

• PIR– http://jura.wi.mit.edu/bio/education/bioinfo/session3/seq.pir

• DND– http://jura.wi.mit.edu/bio/education/bioinfo/session3/seq.dnd

• PH– http://jura.wi.mit.edu/bio/education/bioinfo/session3/seq.ph

Page 18: Bioinformatics for Biologistsbarc.wi.mit.edu/education/bioinfo/lecture7-bw.pdf · 2009. 11. 5. · Bioinformatics for Biologists Comparative Protein Analysis: Part I. ... Global vs.

18

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 35

Demonstrations• Multiple Sequence Alignments

– Clustal (web-based)• http://pir.georgetown.edu/pirwww/search/multaln.html

– ClustalX (local)– Jalview

• Tree Building– PAUP (UNIX-based)– ClustalX– Phylodendron

• http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

WIBR Bioinformatics Course, © Whitehead Institute, October 2003 36

References• Bioinformatics: Sequence and genome Analysis.

David W. Mount. CSHL Press, 2001.• Bioinformatics: A Practical Guide to the Analysis

of Genes and Proteins. Andreas D. Baxevanis andB.F. Francis Ouellete. Wiley Interscience, 2001.

• Bioinformatics: Sequence, structure, anddatabanks. Des Higgins and Willie Taylor. OxfordUniversity Press, 2000.