1 Summary on similarity search or Why do we care about far homologies ? A protein from a new...

38
1 Summary on similarity search or hy do we care about far homologies A protein from a new pathogenic bacteria. We have no idea what it does A protein from a model organism. We know what it does but we do not know A protein related to a disease We have no idea what it does in relation to the

Transcript of 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new...

Page 1: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

1

Summary on similarity searchor

Why do we care about far homologies ?

A protein from a new pathogenic

bacteria.We have no idea

what it does

A protein from a model organism.We know what it does but we do not know who

does the same in human?

A protein related to a disease

We have no idea what it does

in relation to the disease

Page 2: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

retinol-binding protein

odorant-binding protein

apolipoprotein D

Page 3: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

RBP4 and obesity

retinol-binding protein

odorant-binding protein

apolipoprotein D

Page 4: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

Scoring matrices let you focus on the big (or small) picture

retinol-binding proteinretinol-binding

protein

PAM250

PAM30

Blosum45

Blosum80

Page 5: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM

retinol-binding protein

retinol-binding protein

Page 6: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

Phylogenetic trees

Page 7: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

7

Phylogeny is the inference of evolutionary relationships.Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses.

One tree of life A sketch Darwin madesoon after returning from his voyage onHMS Beagle (1831–36) showed his thinkingabout the diversification of speciesfrom a single stock (see Figure, overleaf).This branching, extended by the conceptof common descent,

Phylogeny in Greek =the origin of the tribe

Page 8: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

8

Haeckel (1879) Pace (2001)

Page 9: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

9

Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are based upon DNA and protein sequence data

Human

Chimpanzee

Gorilla

Orangutan

Gorilla

Chimpanzee

Orangutan

Human

Molecular analysis:Chimpanzee is related more closely

to human than the gorilla

Pre-Molecular analysis:The great apes

(chimpanzee, Gorilla & orangutan)Separate from the human

Page 10: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

10

What can we learn from phylogenetics tree?

Page 11: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

• Was the extinct quagga more like a zebra or a horse?

Determine the closest relatives of one organism in which we are interested

Page 12: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

12

Which species are closest to Human?

Human

Chimpanzee

Gorilla

Orangutan

Gorilla

Chimpanzee

Orangutan

Human

Page 13: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

13

Human Evolution

ModernMan

Neanderthals

Page 14: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

14

Example Metagenomics

A new field in genomics aims the study the genomes recovered from environmental samples.

A powerful tool to access the wealthy biodiversity of native environmental samples

Help to find the relationship between the species and identify new species

Page 15: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

106 cells/ ml seawater107 virus particles/ ml seawater

>99% uncultivated microbes

How can we discover new species in the ocean?

Page 16: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

16

Relationships can be represented by Phylogenetic Tree or Dendrogram

A B C D

E

F

Page 17: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

17

Phylogenetic Tree Terminology

• Graph composed of nodes & branches

• Each branch connects two adjacent nodes

A B C D

E

F

R

Page 18: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

18

Rooted tree

Human

Chimp

Chicken

Gorilla

Human ChimpChicken Gorilla

Un-rooted tree

Phylogenetic Tree Terminology

Page 19: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

19

Rooted vs. unrooted trees

1

2

3

3 1

2

Page 20: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

20

How can we build a tree with molecular data?

-Trees based on DNA sequence (rRNA)-Trees based on Protein sequences

Page 21: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

Basic algorithm forconstructing a rooted tree

Unweighted Pair Group Method using Arithmetic Averages

(UPGMA)Assumption: Divergence of sequences is assumed to occur at a constant rate Distance to root is equal

Sequence a ACGCGTTGGGCGATGGCAACSequence b ACGCGTTGGGCGACGGTAATSequence c ACGCATTGAATGATGATAATSequence d ACACATTGAGTGTGATAATA

a b c d

Page 22: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

22

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

Moving from Similarity to Distance

Distance Table

Sequence a ACGCGTTGGGCGATGGCAACSequence b ACACATTGAGTGTGATCAACSequence c ACACATTGAGTGAGGACAACSequence d ACGCGTTGGGCGACGGTAAT

Distances *

Sequences

Dab = 8Dac = 7Dad = 5Dbc = 3Dbd = 9 Dcd = 8

* Can be calculated using different distance metrics

Page 23: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

23

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

a

d

c

b

Step 1:Choose the nodes with the shortest distance and fuse them.

Constructing a tree starting from a STAR model

Page 24: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

24

Step 2: recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodesfrom the table.

dc,b e

aa d e

a 0 5 6

d 5 0 7

e 6 7 0

D (ea) = (D(ac)+ D(ab)-D(cb))/2

D (ed) = (D(dc)+ D(db)-D(cb))/2

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

Page 25: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

25

!!!The distances Dce and Dde are calculated assuming constant rate evolution

d

c

e

a

a d e

a 0 5 6

d 5 0 7

e 6 7 0 b

Dce

Dde

Step 3: In order to get a tree, un-fuse c and b by calculating their distance to the new node (e)

Page 26: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

26

a,d

c

ea d e

a 0 5 6

d 5 0 7

e 6 7 0 b

Dce

Dde

f

Next…

We want to fuse the next closest nodes

Page 27: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

27

ac

ef e

f 0 4

e 4 0

b

Daf

Dde

f

d

Dce

Dbf

Finally

D (ef) = (D(ea)+ D(ed)-D(ad))/2

We need to calculate the distance between e and f

Page 28: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

28

a

d

c

b

acb d

fe

From a Star to a tree

Page 29: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

29

IMPORTANT !!!•Usually we don’t assume a constant mutation rate

and in order to choose the nodes to fuse we have to calculate the relative distance of each node to all other nodes .

Neighbor Joining (NJ)- is an algorithm which is suitable to cases when the rate of evolution varies

Page 30: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

30

Human Evolution Tree

Neighbor JoiningUPGMA

Page 31: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

The down side of phylogenetic trees

- Using different regions from a same alignment may produce different trees.

Page 32: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

Problems with phylogenetic trees

1

7

3

5

6

2

4

0.2

Bacillus

E.coli

Pseudomonas

Salmonella

Aeromonas

Lechevaliera

Burkholderias

Page 33: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

1

7

5

3

6

2

4

0.2

Bacillus

1

3

7

5

6

2

4

0.2

1

5

3

7

6

2

4

0.2

3

5

7

1

6

2

4

0.2

Bacillus

Bacillus

Bacillus

E.coli

E.coli E.coli

E.coli

Pseudomonas

Pseudomonas

Pseudomonas

Pseudomonas

Salmonella

Salmonella Salmonella

Salmonella

Aeromonas

Aeromonas

Aeromonas

Aeromonas

Lechevaliera

Lechevaliera

Lechevaliera

Lechevaliera

Burkholderias

Burkholderias

Burkholderias

Burkholderias

Problems with phylogenetic trees

Page 34: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

Problems with phylogenetic trees

• What to do ?

Page 35: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

35

A.We create new data sets by sampling N positions with replacement.

B.We generate 100 - 1000 such pseudo-data sets. C.For each such data set we reconstruct a tree, using the

same method.D.We note the agreement between the tree reconstructed

from the pseudo-data set to the original tree.

Note: we do not change the number of sequences !

Bootstrapping

Page 36: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

1

3

7

5

6

2

477

100

83

58

0.2

Pseudomonas

Burkholderias

E.coli

Salmonella

Lechevaliera

Aeromonas

Bacillus

Bootstrapped tree

Less reliable Branch

Highly reliable branch

Page 37: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

37

Open Questions

• Do DNA and proteins from the same gene produce different trees ?

• Can different genes have different evolutionary history ?

Page 38: 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.

38