1 Summary on similarity search or Why do we care about far homologies ? A protein from a new...
-
Upload
stephany-harmon -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Summary on similarity search or Why do we care about far homologies ? A protein from a new...
1
Summary on similarity searchor
Why do we care about far homologies ?
A protein from a new pathogenic
bacteria.We have no idea
what it does
A protein from a model organism.We know what it does but we do not know who
does the same in human?
A protein related to a disease
We have no idea what it does
in relation to the disease
retinol-binding protein
odorant-binding protein
apolipoprotein D
RBP4 and obesity
retinol-binding protein
odorant-binding protein
apolipoprotein D
Scoring matrices let you focus on the big (or small) picture
retinol-binding proteinretinol-binding
protein
PAM250
PAM30
Blosum45
Blosum80
PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM
retinol-binding protein
retinol-binding protein
Phylogenetic trees
7
Phylogeny is the inference of evolutionary relationships.Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses.
One tree of life A sketch Darwin madesoon after returning from his voyage onHMS Beagle (1831–36) showed his thinkingabout the diversification of speciesfrom a single stock (see Figure, overleaf).This branching, extended by the conceptof common descent,
Phylogeny in Greek =the origin of the tribe
8
Haeckel (1879) Pace (2001)
9
Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are based upon DNA and protein sequence data
Human
Chimpanzee
Gorilla
Orangutan
Gorilla
Chimpanzee
Orangutan
Human
Molecular analysis:Chimpanzee is related more closely
to human than the gorilla
Pre-Molecular analysis:The great apes
(chimpanzee, Gorilla & orangutan)Separate from the human
10
What can we learn from phylogenetics tree?
• Was the extinct quagga more like a zebra or a horse?
Determine the closest relatives of one organism in which we are interested
12
Which species are closest to Human?
Human
Chimpanzee
Gorilla
Orangutan
Gorilla
Chimpanzee
Orangutan
Human
13
Human Evolution
ModernMan
Neanderthals
14
Example Metagenomics
A new field in genomics aims the study the genomes recovered from environmental samples.
A powerful tool to access the wealthy biodiversity of native environmental samples
Help to find the relationship between the species and identify new species
106 cells/ ml seawater107 virus particles/ ml seawater
>99% uncultivated microbes
How can we discover new species in the ocean?
16
Relationships can be represented by Phylogenetic Tree or Dendrogram
A B C D
E
F
17
Phylogenetic Tree Terminology
• Graph composed of nodes & branches
• Each branch connects two adjacent nodes
A B C D
E
F
R
18
Rooted tree
Human
Chimp
Chicken
Gorilla
Human ChimpChicken Gorilla
Un-rooted tree
Phylogenetic Tree Terminology
19
Rooted vs. unrooted trees
1
2
3
3 1
2
20
How can we build a tree with molecular data?
-Trees based on DNA sequence (rRNA)-Trees based on Protein sequences
Basic algorithm forconstructing a rooted tree
Unweighted Pair Group Method using Arithmetic Averages
(UPGMA)Assumption: Divergence of sequences is assumed to occur at a constant rate Distance to root is equal
Sequence a ACGCGTTGGGCGATGGCAACSequence b ACGCGTTGGGCGACGGTAATSequence c ACGCATTGAATGATGATAATSequence d ACACATTGAGTGTGATAATA
a b c d
22
a b c d
a 0 8 7 5
b 8 0 3 9
c 7 3 0 8
d 5 9 8 0
Moving from Similarity to Distance
Distance Table
Sequence a ACGCGTTGGGCGATGGCAACSequence b ACACATTGAGTGTGATCAACSequence c ACACATTGAGTGAGGACAACSequence d ACGCGTTGGGCGACGGTAAT
Distances *
Sequences
Dab = 8Dac = 7Dad = 5Dbc = 3Dbd = 9 Dcd = 8
* Can be calculated using different distance metrics
23
a b c d
a 0 8 7 5
b 8 0 3 9
c 7 3 0 8
d 5 9 8 0
a
d
c
b
Step 1:Choose the nodes with the shortest distance and fuse them.
Constructing a tree starting from a STAR model
24
Step 2: recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodesfrom the table.
dc,b e
aa d e
a 0 5 6
d 5 0 7
e 6 7 0
D (ea) = (D(ac)+ D(ab)-D(cb))/2
D (ed) = (D(dc)+ D(db)-D(cb))/2
a b c d
a 0 8 7 5
b 8 0 3 9
c 7 3 0 8
d 5 9 8 0
25
!!!The distances Dce and Dde are calculated assuming constant rate evolution
d
c
e
a
a d e
a 0 5 6
d 5 0 7
e 6 7 0 b
Dce
Dde
Step 3: In order to get a tree, un-fuse c and b by calculating their distance to the new node (e)
26
a,d
c
ea d e
a 0 5 6
d 5 0 7
e 6 7 0 b
Dce
Dde
f
Next…
We want to fuse the next closest nodes
27
ac
ef e
f 0 4
e 4 0
b
Daf
Dde
f
d
Dce
Dbf
Finally
D (ef) = (D(ea)+ D(ed)-D(ad))/2
We need to calculate the distance between e and f
28
a
d
c
b
acb d
fe
From a Star to a tree
29
IMPORTANT !!!•Usually we don’t assume a constant mutation rate
and in order to choose the nodes to fuse we have to calculate the relative distance of each node to all other nodes .
Neighbor Joining (NJ)- is an algorithm which is suitable to cases when the rate of evolution varies
30
Human Evolution Tree
Neighbor JoiningUPGMA
The down side of phylogenetic trees
- Using different regions from a same alignment may produce different trees.
Problems with phylogenetic trees
1
7
3
5
6
2
4
0.2
Bacillus
E.coli
Pseudomonas
Salmonella
Aeromonas
Lechevaliera
Burkholderias
1
7
5
3
6
2
4
0.2
Bacillus
1
3
7
5
6
2
4
0.2
1
5
3
7
6
2
4
0.2
3
5
7
1
6
2
4
0.2
Bacillus
Bacillus
Bacillus
E.coli
E.coli E.coli
E.coli
Pseudomonas
Pseudomonas
Pseudomonas
Pseudomonas
Salmonella
Salmonella Salmonella
Salmonella
Aeromonas
Aeromonas
Aeromonas
Aeromonas
Lechevaliera
Lechevaliera
Lechevaliera
Lechevaliera
Burkholderias
Burkholderias
Burkholderias
Burkholderias
Problems with phylogenetic trees
Problems with phylogenetic trees
• What to do ?
35
A.We create new data sets by sampling N positions with replacement.
B.We generate 100 - 1000 such pseudo-data sets. C.For each such data set we reconstruct a tree, using the
same method.D.We note the agreement between the tree reconstructed
from the pseudo-data set to the original tree.
Note: we do not change the number of sequences !
Bootstrapping
1
3
7
5
6
2
477
100
83
58
0.2
Pseudomonas
Burkholderias
E.coli
Salmonella
Lechevaliera
Aeromonas
Bacillus
Bootstrapped tree
Less reliable Branch
Highly reliable branch
37
Open Questions
• Do DNA and proteins from the same gene produce different trees ?
• Can different genes have different evolutionary history ?
38