Computational Methodology for Microbial and Metagenomic Characterization using Large Scale...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Computational Methodology for Microbial and Metagenomic Characterization using Large Scale...
![Page 1: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/1.jpg)
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale
Functional Genomic Data Integration
Curtis Huttenhower
03-08-10Harvard School of Public HealthDepartment of Biostatistics
![Page 2: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/2.jpg)
2
Outline
1. Network models of functional data
2. Network models of microbes
3. Network models of microbiomes
![Page 3: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/3.jpg)
3
Meta-analysis for unsupervisedfunctional data integration
Following up with round-robin and semi-supervised evaluations
Huttenhower 2006Hibbs 2007
1
1log2
1'
'
''
z
+ =
![Page 4: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/4.jpg)
4
Functional network prediction from diverse microbial data
486 bacterial expression
experiments
876 raw datasets
310 postprocessed
datasets
304 normalized coexpression networks
in 27 species
Integrated functional interaction networks
in 15 species
307 bacterial interaction
experiments
154796 raw interactions
114786 postprocessed
interactions
![Page 5: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/5.jpg)
5
Functional maps for cross-speciesknowledge transfer
Following up with unsupervised and partially anchored network alignment
Huttenhower 2008Huttenhower 2009
![Page 6: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/6.jpg)
6
Functional maps for functional metagenomics
Mapping genes into pathways
Mapping pathways into
organisms
+ Integrated functional interaction networks
in 27 species
Mapping organisms into phyla
=
GOS 4441599.3Hypersaline Lagoon, Ecuador
![Page 7: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/7.jpg)
7
Functional maps for functional metagenomics
NodesProcess cohesiveness in obesity
VeryDownregulated
Baseline(no change)
VeryUpregulated
EdgesProcess association in obesity
MoreCoregulated
LessCoregulated
Baseline(no change)
Summarizes information from ~10M metagenomic reads and ~500 genome-
scale microbial experiments.
![Page 8: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/8.jpg)
8
• Sleipnir C++ library for computational functional genomics
• Data types for biological entities• Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.• Network communication, parallelization
• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)
• And it’s fully documented!
Efficient Computation For Biological Discovery
Massive datasets and genomes require efficient algorithms and implementations.
It’s also speedy: microbial data integration
computationtakes <3hrs.
![Page 9: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/9.jpg)
9
Thanks!
NIGMShttp://function.princeton.edu/hefalmp
http://huttenhower.sph.harvard.edu/sleipnir
Olga TroyanskayaMatt HibbsChad MyersDavid HessChris ParkAna PopAaron Wong
Hilary CollerErin Haley
Jacques Izard
Wendy Garrett
Sarah FortuneTracy Rosebrock
![Page 10: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/10.jpg)
![Page 11: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/11.jpg)
11
Functional mapping:Functional associations between processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
Information mapped from ~100 E. coli experiments
![Page 12: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/12.jpg)
12
Meta-analysis for unsupervisedfunctional data integration
Following up with round-robin and semi-supervised evaluations
Evangelou 2007
Huttenhower 2006Hibbs 2007
1
1log2
1'
'
''
z
+ =
![Page 13: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/13.jpg)
13
Functional mapping: mining integrated networks
Predicted relationships between genes
HighConfidence
LowConfidence
The strength of these relationships indicates how
cohesive a process is.
Chemotaxis
![Page 14: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/14.jpg)
14
Functional mapping: mining integrated networks
Predicted relationships between genes
HighConfidence
LowConfidence
Chemotaxis
![Page 15: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/15.jpg)
15
Functional mapping: mining integrated networks
Flagellar assembly
The strength of these relationships indicates how
associated two processes are.
Predicted relationships between genes
HighConfidence
LowConfidence
Chemotaxis
![Page 16: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/16.jpg)
16
Functional maps for cross-speciesknowledge transfer
G17
G16G15
G10
G6
G9
G8
G5
G11
G7
G12
G13
G14
G2
G1
G4
G3
O8
O4O5
O7
O9
O6
O2
O3
O1
O1: G1, G2, G3O2: G4O3: G6…
ECG1, ECG2BSG1ECG3, BSG2…
![Page 17: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/17.jpg)
17
Functional network prediction from diverse microbial data
486 bacterial expression
experiments
876 raw datasets
310 postprocessed
datasets
304 normalized coexpression networks
in 27 species
Integrated functional interaction networks
in 15 species
307 bacterial interaction
experiments
154796 raw interactions
114786 postprocessed
interactions
E. Coli Integration
← Precision ↑, Recall ↓
![Page 18: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/18.jpg)
18
Functional maps for functional metagenomics
GOS 4441599.3Hypersaline Lagoon, Ecuador
KEGG Pathways
Org
anis
ms
Pathog ens
Env.
Mapping genes into pathways
Mapping pathways into
organisms
+ Integrated functional interaction networks
in 27 species
Mapping organisms into phyla
=
![Page 19: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/19.jpg)
19
Functional maps for cross-speciesknowledge transfer
← Precision ↑, Recall ↓
Following up with unsupervised and partially anchored network alignment
![Page 20: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/20.jpg)
20
E. Coli Integration
Functional network prediction from diverse microbial data
486 bacterial expression
experiments
876 raw datasets
310 postprocessed
datasets
304 normalized coexpression networks
in 27 species
Integrated functional interaction networks
in 15 species
307 bacterial interaction
experiments
154796 raw interactions
114786 postprocessed
interactions
![Page 21: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/21.jpg)
21
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
Data integration summarizes an impossibly huge amount of experimental data into an
impossibly huge number of predictions; what next?
![Page 22: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/22.jpg)
22
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
How can a biologist take advantage of all this data to study
his/her favorite gene/pathway/disease without
losing information?
Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease
associations• Underlying experimental results and
functional activities in data
![Page 23: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/23.jpg)
23
Functional Mapping:Scoring Functional Associations
How can we formalizethese relationships?
Any sets of genes G1 and G2 in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set• The background edges
incident to each set• The baseline of all edges
in the network
),(),(
),(
2121
21, 21 GGwithin
baseline
GGbackground
GGbetweenFA GG
Stronger connections between the sets increase association.
Stronger within self-connections or nonspecific background connections decrease association.
![Page 24: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/24.jpg)
24
Functional Mapping:Bootstrap p-values
• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?
Empirically!# Genes 1 5 10 50
1
5
10
50
Histograms of FAs for random sets
For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is
approximately normal with mean 1.
Standard deviation is asymptotic in the sizes
of both gene sets.
Maps FA scores to p-values for any gene sets and
underlying graph.
100
102
104
100
101
102
103
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
|G1|
|G2|
Null distribution σs for one graph
|)(|||
|||)(|),(ˆ
1),(ˆ
ji
jijiFA
jiFA
GCG
BGGAGG
GG
)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG
![Page 25: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/25.jpg)
25
Microbial Communities andFunctional Metagenomics
• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!
• Pathogen collections of “single” organisms form similar communities
• Another data integration problem– Must include datasets from multiple organisms
• What questions can we answer?– What pathways/processes are present/over/under-
enriched in a newly sequences microbe/community?– What’s shared within community X?
What’s different? What’s unique?– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …– Current functional methods annotate
~50% of synthetic data, <5% of environmental data
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
With Jacques Izard, Wendy Garrett
![Page 26: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/26.jpg)
26
Data Integration for Microbial Communities
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
~350 available expression datasets
~25 species
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
• Data integration works just as well in microbes as it does in yeast and humans• We know an awful lot about some microorganisms and almost nothing about others• Sequence-based and network-based tools for function transfer both work in isolation• We can use data integration to leverage both and mine out additional biology
![Page 27: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/27.jpg)
27
Functional Maps forFunctional Metagenomics
![Page 28: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/28.jpg)
28
Validating Orthology-BasedFunctional Mapping
Does unweighted data integration predict functional relationships?
What is the effect of “projecting” through an orthologous space?
Recall
log(
Pre
cisi
on/R
ando
m)
KEGG
GO
Recall
log(
Pre
cisi
on/R
ando
m)
Recall
log(
Pre
cisi
on/R
ando
m)
GO
Unsupervised integration
Individual datasets
Recall
log(
Pre
cisi
on/R
ando
m) Individual
datasets
KEGG
Unsupervised integration
![Page 29: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/29.jpg)
29
Validating Orthology-BasedFunctional Mapping
YG17
YG16YG15
YG10
YG6
YG9
YG8
YG5
YG11
YG7
YG12
YG13
YG14
YG2
YG1
YG4
YG3Holdout set,
uncharacterized “genome”
Random subsets,characterized “genomes”
![Page 30: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/30.jpg)
30
Validating Orthology-BasedFunctional Mapping
![Page 31: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3b5503460f94a162e0/html5/thumbnails/31.jpg)
31KEGG KEGG
GO GO
Validating Orthology-BasedFunctional Mapping
Can subsets of the yeast genome predict a heldout subset’s
functional maps?
Can subsets of the yeast genome predict a heldout subset’s
interactome?
0.68 0.48
0.39 0.25
0.30 0.37
0.27 0.39
0.43
0.40
What have we learned?• Yeast is incredibly well-curated
• KEGG tends to be more specific than GO
• Predicting interactomes by projecting through
functional maps
works decently in the absolute best case