1 Summary on similarity search or Why do we care about far homologies ? A protein from a new...

Post on 31-Dec-2015

215 views 0 download

Tags:

Transcript of 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new...

1

Summary on similarity searchor

Why do we care about far homologies ?

A protein from a new pathogenic

bacteria.We have no idea

what it does

A protein from a model organism.We know what it does but we do not know who

does the same in human?

A protein related to a disease

We have no idea what it does

in relation to the disease

retinol-binding protein

odorant-binding protein

apolipoprotein D

RBP4 and obesity

retinol-binding protein

odorant-binding protein

apolipoprotein D

Scoring matrices let you focus on the big (or small) picture

retinol-binding proteinretinol-binding

protein

PAM250

PAM30

Blosum45

Blosum80

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM

retinol-binding protein

retinol-binding protein

Phylogenetic trees

7

Phylogeny is the inference of evolutionary relationships.Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses.

One tree of life A sketch Darwin madesoon after returning from his voyage onHMS Beagle (1831–36) showed his thinkingabout the diversification of speciesfrom a single stock (see Figure, overleaf).This branching, extended by the conceptof common descent,

Phylogeny in Greek =the origin of the tribe

8

Haeckel (1879) Pace (2001)

9

Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are based upon DNA and protein sequence data

Human

Chimpanzee

Gorilla

Orangutan

Gorilla

Chimpanzee

Orangutan

Human

Molecular analysis:Chimpanzee is related more closely

to human than the gorilla

Pre-Molecular analysis:The great apes

(chimpanzee, Gorilla & orangutan)Separate from the human

10

What can we learn from phylogenetics tree?

• Was the extinct quagga more like a zebra or a horse?

Determine the closest relatives of one organism in which we are interested

12

Which species are closest to Human?

Human

Chimpanzee

Gorilla

Orangutan

Gorilla

Chimpanzee

Orangutan

Human

13

Human Evolution

ModernMan

Neanderthals

14

Example Metagenomics

A new field in genomics aims the study the genomes recovered from environmental samples.

A powerful tool to access the wealthy biodiversity of native environmental samples

Help to find the relationship between the species and identify new species

106 cells/ ml seawater107 virus particles/ ml seawater

>99% uncultivated microbes

How can we discover new species in the ocean?

16

Relationships can be represented by Phylogenetic Tree or Dendrogram

A B C D

E

F

17

Phylogenetic Tree Terminology

• Graph composed of nodes & branches

• Each branch connects two adjacent nodes

A B C D

E

F

R

18

Rooted tree

Human

Chimp

Chicken

Gorilla

Human ChimpChicken Gorilla

Un-rooted tree

Phylogenetic Tree Terminology

19

Rooted vs. unrooted trees

1

2

3

3 1

2

20

How can we build a tree with molecular data?

-Trees based on DNA sequence (rRNA)-Trees based on Protein sequences

Basic algorithm forconstructing a rooted tree

Unweighted Pair Group Method using Arithmetic Averages

(UPGMA)Assumption: Divergence of sequences is assumed to occur at a constant rate Distance to root is equal

Sequence a ACGCGTTGGGCGATGGCAACSequence b ACGCGTTGGGCGACGGTAATSequence c ACGCATTGAATGATGATAATSequence d ACACATTGAGTGTGATAATA

a b c d

22

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

Moving from Similarity to Distance

Distance Table

Sequence a ACGCGTTGGGCGATGGCAACSequence b ACACATTGAGTGTGATCAACSequence c ACACATTGAGTGAGGACAACSequence d ACGCGTTGGGCGACGGTAAT

Distances *

Sequences

Dab = 8Dac = 7Dad = 5Dbc = 3Dbd = 9 Dcd = 8

* Can be calculated using different distance metrics

23

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

a

d

c

b

Step 1:Choose the nodes with the shortest distance and fuse them.

Constructing a tree starting from a STAR model

24

Step 2: recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodesfrom the table.

dc,b e

aa d e

a 0 5 6

d 5 0 7

e 6 7 0

D (ea) = (D(ac)+ D(ab)-D(cb))/2

D (ed) = (D(dc)+ D(db)-D(cb))/2

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

25

!!!The distances Dce and Dde are calculated assuming constant rate evolution

d

c

e

a

a d e

a 0 5 6

d 5 0 7

e 6 7 0 b

Dce

Dde

Step 3: In order to get a tree, un-fuse c and b by calculating their distance to the new node (e)

26

a,d

c

ea d e

a 0 5 6

d 5 0 7

e 6 7 0 b

Dce

Dde

f

Next…

We want to fuse the next closest nodes

27

ac

ef e

f 0 4

e 4 0

b

Daf

Dde

f

d

Dce

Dbf

Finally

D (ef) = (D(ea)+ D(ed)-D(ad))/2

We need to calculate the distance between e and f

28

a

d

c

b

acb d

fe

From a Star to a tree

29

IMPORTANT !!!•Usually we don’t assume a constant mutation rate

and in order to choose the nodes to fuse we have to calculate the relative distance of each node to all other nodes .

Neighbor Joining (NJ)- is an algorithm which is suitable to cases when the rate of evolution varies

30

Human Evolution Tree

Neighbor JoiningUPGMA

The down side of phylogenetic trees

- Using different regions from a same alignment may produce different trees.

Problems with phylogenetic trees

1

7

3

5

6

2

4

0.2

Bacillus

E.coli

Pseudomonas

Salmonella

Aeromonas

Lechevaliera

Burkholderias

1

7

5

3

6

2

4

0.2

Bacillus

1

3

7

5

6

2

4

0.2

1

5

3

7

6

2

4

0.2

3

5

7

1

6

2

4

0.2

Bacillus

Bacillus

Bacillus

E.coli

E.coli E.coli

E.coli

Pseudomonas

Pseudomonas

Pseudomonas

Pseudomonas

Salmonella

Salmonella Salmonella

Salmonella

Aeromonas

Aeromonas

Aeromonas

Aeromonas

Lechevaliera

Lechevaliera

Lechevaliera

Lechevaliera

Burkholderias

Burkholderias

Burkholderias

Burkholderias

Problems with phylogenetic trees

Problems with phylogenetic trees

• What to do ?

35

A.We create new data sets by sampling N positions with replacement.

B.We generate 100 - 1000 such pseudo-data sets. C.For each such data set we reconstruct a tree, using the

same method.D.We note the agreement between the tree reconstructed

from the pseudo-data set to the original tree.

Note: we do not change the number of sequences !

Bootstrapping

1

3

7

5

6

2

477

100

83

58

0.2

Pseudomonas

Burkholderias

E.coli

Salmonella

Lechevaliera

Aeromonas

Bacillus

Bootstrapped tree

Less reliable Branch

Highly reliable branch

37

Open Questions

• Do DNA and proteins from the same gene produce different trees ?

• Can different genes have different evolutionary history ?

38