"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
Jonathan Eisen talk "Phylogneomic approaches to functional prediction"a #AFP2012 #ISMB
-
Upload
jonathan-eisen -
Category
Health & Medicine
-
view
3.036 -
download
6
description
Transcript of Jonathan Eisen talk "Phylogneomic approaches to functional prediction"a #AFP2012 #ISMB
Phylogenomic Approaches to Functional Prediction
Automated Function Prediction SIGISMB 2012
July 13, 2012
Jonathan A. EisenUniversity of California, Davis
@phylogenomics
Saturday, July 14, 12
PAFP
Automated Function Prediction SIGISMB 2012
July 13, 2012
Jonathan A. EisenUniversity of California, Davis
@phylogenomics
Saturday, July 14, 12
PAFP
AFP SIGISMB 2012
July 13, 2012
Jonathan A. EisenUniversity of California, Davis
@phylogenomics
Saturday, July 14, 12
PAFP AFP SIG ISMB 2012
July 13, 2012
Jonathan A. EisenUniversity of California, Davis
@phylogenomics
Saturday, July 14, 12
Acknowledgements
• $$$• DOE• NSF• GBMF• Sloan• DARPA• DSMZ• DHS
• People, places• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell
Neches, Jenna Morgan-Lang• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak,
Jack Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward, Hans-Peter Klenk, Phil Hanawalt
Saturday, July 14, 12
Phylogenomics of Novelty
Mechanisms of Origin of New
Functions
Species Evolution
Variation in Mechanisms:
Patterns, Causes and Effects
Saturday, July 14, 12
Origin of Novelty
• How does novelty originate?• What are the constraints on evolvability?• What leads to variation within the genome
and within and between species in evolvability
• This information helps interpret the past, understand the present and (maybe) predict the future
Saturday, July 14, 12
History
Saturday, July 14, 12
Whatever the History: Trying to Incorporate it is Critical
from Lake et al. doi: 10.1098/rstb.2009.0035
Saturday, July 14, 12
PAFP AFP SIG ISMB 2012 I:
Predicting Functions with Evolutionary Trees
Saturday, July 14, 12
SNF2 Family of Proteins (1995)
• SNF2 family defined by presence of conserved DNA-dependent ATPase domain
• 100s of proteins• Diversity of functions:
• transcriptional activation (SNF2)• transcriptional repression (MOT1)• Recombination (RAD54)• transcription-coupled repair (CSB)• post-replication repair (RAD5)• chromosome segregation (lodestar)• Many with unknown functions
• Some species have 15+ representatives
Bork and Koonin 1993
Saturday, July 14, 12
SNF2 Alignment BRM
hBRM
hBRG1
mBRG1
STH1
SNF2
YB95
F37A4
ISWI
SNF2L
CHD1
SYGP
ETL1
FUN30
MOT1
ERCC6
RAD26
YB53
DNRPPX
hNUCP
mNUCP
RAD5
spRAD8
HIP116
RAD16
LODE
NPH42
HepA
B.cereus ORF
I Ia Ib II III V VI
C
C
R
R
R
R
B r
CHD1
SNF2
SNF2L
ETL1
RAD16
ERCC6
RAD54
RAD54
B r
B r
B r
B r
B r
ProteinSub-
Family
SCALE (aa)
0 500
Helicase Motifs --
MOT1
IV
Saturday, July 14, 12
Saturday, July 14, 12
SNF2 Subfamilies
Subfamily Function
SNF2 Transcription activation (Swi/Snf complex)SNF2L Transcription activation (NURF complex)CHD1 Chromatin remodellingETL1 UnknownMOT1 Transcription repressionCSB Transcription-coupled repairRad54 Recombinational repairRad16 Chromatin access for DNA repairHepA Bacterial RNA polymerase subunit
Saturday, July 14, 12
SNF2 Tree and F(x) Prediction
• Function conserved within but not between subfamilies/orthology groups
• Therefore, assignment of genes to subfamilies can be used to predict functions of unknowns
• Grouping into subfamilies helps identify motifs conserved within groups
• Phylogeny recovers subfamilies better than similarity searches
Saturday, July 14, 12
From Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Saturday, July 14, 12
Blast Search of H. pylori “MutS”
• Blast search pulls up Syn. sp MutS#2 with much higher p value than other MutS homologs
• Based on this TIGR predicted this species had mismatch repair
• Assumes functional constancy Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Saturday, July 14, 12
MutL??
From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html
Saturday, July 14, 12
Phylogenetic Tree of MutS Family
Aquae Trepa
FlyXenlaRatMouseHumanYeastNeucrArath
BorbuStrpyBacsu
SynspEcoli Neigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombeMouseHumanArath
YeastHumanMouseArath
StrpyBacsu
CelegHumanYeast MetthBorbu
AquaeSynspDeira Helpy
mSaco
YeastCelegHuman
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.
Saturday, July 14, 12
MutS Subfamilies
Aquae Trepa
FlyXenlaRatMouse
HumanYeast
NeucrArath
BorbuStrpy
BacsuSynsp
EcoliNeigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombe
MouseHumanArath
YeastHumanMouse
Arath
StrpyBacsu
CelegHumanYeast
MetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4
MSH5 MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.
Saturday, July 14, 12
Overlaying Functions onto Tree
Aquae Trepa
Rat
FlyXenla
MouseHumanYeast
NeucrArath
BorbuSynsp
Neigo
ThemaStrpy
Bacsu
Ecoli
TheaqDeiraChltr
SpombeYeast
YeastSpombe
MouseHuman
Arath
YeastHumanMouseArath
StrpyBacsu
HumanCeleg
YeastMetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4
MSH5MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.
Saturday, July 14, 12
MutS Subfamilies
• MutS1 Bacterial MMR• MSH1 Euk - mitochondrial MMR• MSH2 Euk - all MMR in nucleus• MSH3 Euk - loop MMR in nucleus• MSH6 Euk - base:base MMR in nucleus
• MutS2 Bacterial - function unknown• MSH4 Euk - meiotic crossing-over• MSH5 Euk - meiotic crossing-over
Saturday, July 14, 12
Functional Prediction Using Tree
Aquae Trepa
FlyXenlaRatMouse
HumanYeast
NeucrArath
BorbuStrpy
BacsuSynsp
EcoliNeigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombe
MouseHumanArath
YeastHumanMouseArath
MSH1MitochondrialRepair
MSH3 - Nuclear RepairOf Loops
MSH6 - Nuclear RepairOf Mismatches
MutS1 - Bacterial Mismatch and Loop Repair
StrpyBacsu
CelegHumanYeast
MetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4 - Meiotic CrossingOver
MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions
MSH2 - Eukaryotic NuclearMismatch and Loop Repair
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.
Saturday, July 14, 12
Ancient MutS Duplication
Saturday, July 14, 12
Table 3. Presence of MutS Homologs in Complete Genomes Sequences
Species # of MutSHomologs
WhichSubfamilies?
MutLHomologs
BacteriaEscherichia coli K12 1 MutS1 1Haemophilus influenzae Rd KW20 1 MutS1 1Neisseria gonorrhoeae 1 MutS1 1Helicobacter pylori 26695 1 MutS2 -Mycoplasma genitalium G-37 - - -Mycoplasma pneumoniae M129 - - -Bacillus subtilis 169 2 MutS1,MutS2 1Streptococcus pyogenes 2 MutS1,MutS2 1Mycobacterium tuberculosis - - -Synechocystis sp. PCC6803 2 MutS1,MutS2 1Treponema pallidum Nichols 1 MutS1 1Borrelia burgdorferi B31 2 MutS1,MutS2 1Aquifex aeolicus 2 MutS1,MutS2 1Deinococcus radiodurans R1 2 MutS1,MutS2 1
ArchaeaArchaeoglobus fulgidus VC-16, DSM4304 - - -Methanococcus janasscii DSM 2661 - - -Methanobacterium thermoautotrophicum ΔH 1 MutS2 -
EukaryotesSaccharomyces cerevisiae 6 MSH1-6 3+Homo sapiens 5 MSH2-6 3+
MutS1,2 vs MutL
Saturday, July 14, 12
Saturday, July 14, 12
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Saturday, July 14, 12
Evolutionary Rate Variation
2
3
14
5
6
Saturday, July 14, 12
Functional Diversity of Proteorhodopsins?
Venter et al., Science 304: 66. 2004
Saturday, July 14, 12
A single tree with everything?
Phylogenetic Challenge
Saturday, July 14, 12
Phylosift/ pplacer
Saturday, July 14, 12
DNA extraction
PCRSequence
rRNA genes
Sequence alignment = Data matrixPhylogenetic tree
PCR
rRNA1
rRNA2
Makes lots of copies of the rRNA genes in sample
rRNA1 5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
CrRNA1
E. coli Humans
rRNA2rRNA2
5’..TACAGTATAGGTGGAGCTAGCGACGATCGA... 3’
rRNA3 5’...ACGGCAAAATAGGTGGATT
CTAGCGATATAGA... 3’
rRNA4 5’...ACGGCCCGATAGGTGGATT
CTAGCGCCATAGA... 3’
rRNA3 C A C T G T
rRNA4 C A C A G T
Yeast T A C A G T
Yeast
rRNA3 rRNA4
rRNA Phylotyping
Saturday, July 14, 12
Eisen et al. 2002
Eisen et al. 1992
Saturday, July 14, 12
PAFP AFP SIG ISMB 2012 II:
Every gene family is unique ...
Saturday, July 14, 12
Saturday, July 14, 12
Steps in Phylogenomics
• Create database of genes of interest
• Presence/absence of homologs in complete genomes
• Phylogenetic trees of each gene family
• Infer evolutionary events (gene origin, duplication, loss and transfer)
• Refine presence/absence (orthologs, paralogs, subfamilies)
• Functional predictions and functional evolution
• Analysis of pathways
Saturday, July 14, 12
Photoreactivation/Photolyases
• All photoreactivation is carried out by enzymes in the photolyase family
• Two main classes of photolyases – class I and class II – are distantly related to each other and likely the result of an ancient duplication
• PhrI and PhrII missing from most species for which complete genomes are available.
• Many cases of functional change (e.g., CPD -> 6-4) and some are not even involved in DNA repair
• Many of the eukaryotic proteins appear to be of an organellar ancestry
Saturday, July 14, 12
Photoreactivation• All known enzymes that perform photoreactivation are part of
a single large photolyase gene family
• Some members of the family do not function as photolyases, but instead work as blue-light receptors
• If a species does not encode a member of the photolyase gene family, it likely does not have photoreactivation capability
• If a species encodes a photolyase, one cannot conclude it has photolyase activity
• Position of photolyase homologs within photolyase tree helps predict what activities they have
Saturday, July 14, 12
Alkyltransferases
• All known alkyltransferases are members of a single gene family
• Found in most but not all species
• Likely present in LUCA
• Ada protein in E. coli originated by fusion between an alkyltransferase and a transcription-regulatory domain
• Gram-positive bacteria have the Ada domain fused to an alkylation glycosylase instead of alkyltransferase
Saturday, July 14, 12
BER Glycosylases
• Distribution patterns highly uneven but some glycosylases have been found in all species
• Some are ancient enzymes, probably presence in LUCA (e.g., MutY-Nth), others more recent (e.g., TagI).
• Many families are distantly related to each other (e.g., Ogg, AlkA, MutY-Nth)
• Many cases of gene duplication, loss and possibly transfer, especially from organellar genomes to nucleus
• Orthologs frequently have different specificity
Saturday, July 14, 12
AP Endonucleases
• All species encode either Nfo or Xth homologs. Some encode both.
• Only Nfo: mycoplasmas, Aquifex, M. jannascii, yeast
• Only Xth: many bacteria, A. fulgidus, humans (so far)
• Both: E. coli, B. subtilis, M. tuberculosis, M. thermoautotrophicum
• Both Nfo and Xth are likely ancient.
• Many cases of gene loss of one or the other, but never both
Saturday, July 14, 12
Uracil Glycosylase
• Many non-homologous proteins have uracil-DNA glycosylase activity (Ung, GPADH, MUG, cyclin)
• Therefore, absence of homologs of these genes should not be used to infer likely absence of activity
• However, presence of homologs of Ung and MUG genes can be used to indicate presence of activity because all homologs of these genes have this activity
Saturday, July 14, 12
Not Open Access
Saturday, July 14, 12
Saturday, July 14, 12
PAFP AFP SIG ISMB 2012 III:
When phylogeny is not enough ...
Saturday, July 14, 12
But ...
• Many powerful and automated similarity based methods for assigning genes to protein families• COGs• PFAM HMM searches
• Some limitations of similarity based methods can be overcome by phylogenetic approaches
• Automated methods now available• Sean Eddy• Steven Brenner• Kimmen Sjölander
• But …
Saturday, July 14, 12
Example: Recent Changes
E.coli gi1787690
B.subtilis gi2633766Synechocystis sp. gi1001299Synechocystis sp. gi1001300Synechocystis sp. gi1652276Synechocystis sp. gi1652103H.pylori gi2313716H.pylori99 gi4155097C.jejuni Cj1190cC.jejuni Cj1110cA.fulgidus gi2649560A.fulgidus gi2649548B.subtilis gi2634254B.subtilis gi2632630B.subtilis gi2635607B.subtilis gi2635608B.subtilis gi2635609B.subtilis gi2635610B.subtilis gi2635882E.coli gi1788195E.coli gi2367378E.coli gi1788194
E.coli gi1789453C.jejuni Cj0144C.jejuni Cj0262c
H.pylori gi2313186H.pylori99 gi4154603C.jejuni Cj1564
C.jejuni Cj1506cH.pylori gi2313163H.pylori99 gi4154575H.pylori gi2313179H.pylori99 gi4154599C.jejuni Cj0019cC.jejuni Cj0951cC.jejuni Cj0246cB.subtilis gi2633374T.maritima TM0014
T.pallidum gi3322777T.pallidum gi3322939T.pallidum gi3322938B.burgdorferi gi2688522T.pallidum gi3322296B.burgdorferi gi2688521T.maritima TM0429T.maritima TM0918T.maritima TM0023T.maritima TM1428T.maritima TM1143T.maritima TM1146P.abyssi PAB1308P.horikoshii gi3256846P.abyssi PAB1336P.horikoshii gi3256896P.abyssi PAB2066P.horikoshii gi3258290P.abyssi PAB1026P.horikoshii gi3256884D.radiodurans DRA00354D.radiodurans DRA0353D.radiodurans DRA0352P.abyssi PAB1189P.horikoshii gi3258414B.burgdorferi gi2688621M.tuberculosis gi1666149
V.cholerae VC0512V.cholerae VCA1034V.cholerae VCA0974V.cholerae VCA0068V.cholerae VC0825V.cholerae VC0282V.cholerae VCA0906V.cholerae VCA0979V.cholerae VCA1056V.cholerae VC1643V.cholerae VC2161V.cholerae VCA0923V.cholerae VC0514V.cholerae VC1868V.cholerae VCA0773V.cholerae VC1313V.cholerae VC1859V.cholerae VC1413V.cholerae VCA0268V.cholerae VCA0658V.cholerae VC1405V.cholerae VC1298V.cholerae VC1248V.cholerae VCA0864V.cholerae VCA0176V.cholerae VCA0220V.cholerae VC1289V.cholerae VCA1069V.cholerae VC2439V.cholerae VC1967V.cholerae VCA0031V.cholerae VC1898V.cholerae VCA0663V.cholerae VCA0988V.cholerae VC0216V.cholerae VC0449V.cholerae VCA0008V.cholerae VC1406V.cholerae VC1535V.cholerae VC0840
V.cholerae VC0098V.cholerae VCA1092
V.cholerae VC1403V.cholerae VCA1088
V.cholerae VC1394
V.cholerae VC0622
NJ
**
*****
******
****
***
****
**
*
****
**
**
******
******
*
****
******
***
***
***
****
**
*
****
*
• Phylogenomic functional prediction may not work well for very newly evolved functions
• Can use understanding of origin of novelty to better interpret these cases?
• Screen genomes for genes that have changed recently
– Pseudogenes and gene loss– Contingency Loci– Acquisition (e.g., LGT)– Unusual dS/dN ratios– Rapid evolutionary rates– Recent duplications
Saturday, July 14, 12
Non-Homology Predictions: Phylogenetic Profiling
• Step 1: Search all genes in organisms of interest against all other genomes
• Ask: Yes or No, is each gene found in each other species
• Cluster genes by distribution patterns (profiles)
Pelligrini et al. 1999. PNAS 96: 4285.Saturday, July 14, 12
Correlated gain/loss of genes
• Microbial genes are lost rapidly when not maintained by selection
• Genes can be acquired by lateral transfer• Frequently gain and loss occurs for entire
pathways/processes• Thus might be able to use correlated presence/
absence information to identify genes with similar functions
Saturday, July 14, 12
Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring• Thermophile (grows at 80°C)• Anaerobic• Grows very efficiently on CO • Produces hydrogen gas• Low GC Gram + (Firmicute)• Genome Determined
Wu et al. 2005 PLoS Genetics 1: e65.
Saturday, July 14, 12
Homologs of Sporulation Genes
Wu et al. 2005 PLoS Genetics 1: e65.
Saturday, July 14, 12
Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.
Saturday, July 14, 12
Wu et al. 2005 PLoS Genetics 1: e65. Saturday, July 14, 12
PG Profiling Works Better with Families
Saturday, July 14, 12
PAFP AFP SIG ISMB 2012 IV:
Knowing What You Don’t Know
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Most genomes from three phyla
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Most genomes from three phyla
• Some studies in other phyla
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Most genomes from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Eukaryotes
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Most genomes from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Viruses
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
TIGR TOL 2002
Saturday, July 14, 12
GEBA
Saturday, July 14, 12
GEBA Lesson 1: Improves genome annotation
• Took 56 GEBA genomes and compared results vs. 56 randomly sampled new genomes
• Better definition of protein family sequence “patterns”• Greatly improves “comparative” and “evolutionary”
based predictions• Conversion of hypothetical into conserved hypotheticals• Linking distantly related members of protein families• Improved non-homology prediction
Saturday, July 14, 12
GEBA Lesson 2:Metadata Important
Saturday, July 14, 12
GEBA Lesson 3:Improves discovering new genetic diversity
Saturday, July 14, 12
Protein Family Rarefaction
• Take data set of multiple complete genomes
• Identify all protein families using MCL• Plot # of genomes vs. # of protein families
Saturday, July 14, 12
Wu et al. 2009 Nature 462, 1056-1060
Saturday, July 14, 12
Wu et al. 2009 Nature 462, 1056-1060
Saturday, July 14, 12
Wu et al. 2009 Nature 462, 1056-1060
Saturday, July 14, 12
Wu et al. 2009 Nature 462, 1056-1060
Saturday, July 14, 12
Wu et al. 2009 Nature 462, 1056-1060
Saturday, July 14, 12
Synapomorphies exist
Wu et al. 2009 Nature 462, 1056-1060
Saturday, July 14, 12
Families/PD not uniform
� �
�������6���
3����1�����
Saturday, July 14, 12
Structural Novelty
• Of the 17000 protein families in the GEBA56, 1800 are novel in sequence (Wu)
• Structural modeling suggests many are structurally novel too (D'haeseleer)
• 372 being crystallized by the PSI (Kerfeld)
Saturday, July 14, 12
Needed Reference Tree
Saturday, July 14, 12
GEBA Lesson 4:Much diversity untouched
Saturday, July 14, 12
rRNA Tree of Life
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Saturday, July 14, 12
Phylogenetic Diversity:
From Wu et al. 2009 Nature 462, 1056-1060
Saturday, July 14, 12
Phylogenetic Diversity with
From Wu et al. 2009 Nature 462, 1056-1060
Saturday, July 14, 12
Phylogenetic Diversity: Isolates
From Wu et al. 2009 Nature 462, 1056-1060Saturday, July 14, 12
Haloarchaeal GEBA-like
Saturday, July 14, 12
Phylogenetic Diversity: All
From Wu et al. 2009 Nature 462, 1056-1060Saturday, July 14, 12
Uncultured Lineages:
• Get into culture• Enrichment cultures• If abundant in low diversity ecosystems• Flow sorting• Microbeads• Microfluidic sorting• Single cell amplification
Saturday, July 14, 12
80
Number of SAGs from Candidate Phyla
OD
1
OP
11
OP
3
SA
R4
06
Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -
Sample collections at 4 additional sites are underway.
Phil Hugenholtz
GEBA uncultured
Saturday, July 14, 12
GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
RecA, RpoB in GOS
Wu et al PLoS One 2011
Saturday, July 14, 12
GEBA Lesson 6:Experimental diversity
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
• Some studies in other phyla
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Eukaryotes
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Viruses
As of 2002
Based on Hugenholtz, 2002
Saturday, July 14, 12
0.1
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
Need experimental studies from across the tree too
Based on Hugenholtz, 2002
Saturday, July 14, 12
0.1
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
Adopt a Microbe
Based on Hugenholtz, 2002
Saturday, July 14, 12
Acknowledgements
• $$$• DOE• NSF• GBMF• Sloan• DARPA• DSMZ• DHS
• People, places• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell
Neches, Jenna Morgan-Lang• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak,
Jack Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward, Hans-Peter Klenk
Saturday, July 14, 12